(19) 




Europaisches Patentamt 
European Patent Office 
Office europeen des brevets 



(11) 



EP 0 974 111 B1 



(12) 



EUROPEAN PATENT SPECIFICATION 



(45) Dale of publication and mention 
of the grant of the patent: 
08.01.2003 Bulletin 2003/02 

(21) Application number 98915478.6 

(22) Date of filing: 10.04.1998 



(61) lntCI.7: G06F 17/50. C07K 1/00 

(86) International application number: 
PCT/US98/07254 

(67) International publication number: 

WO 98/047089 (22.10.1998 Gazette 1998/42) 



(54) APPARATUS AND METHOD FOR AUTOMATED PROTEIN DESIGN 

GERAT UNO VERFAHREN FUR AUTOMATISCHEN PROTEIN-ENTWURF 

DISPOSITIF ET METHODE PERMETTANT UNE MISE AU POINT INFORMATISEE DE PROTEINES 



CD 



QL 



(84) Designated Contracting States: 
BE CH DE DK FR GB U 

(30) Priority: 11.04.1997 US 43464 P 
04.08.1997 US 54678 P 
03.10.1997 US 61097 P 

(43) Date of publication of application: 
26.01.2000 Bulletin 2000/04 

(60) Divisional application: 
02015990.1 / 1 255 209 

(73) Proprietor. CALIFORNIA INSTITUTE OF 
TECHNOLOGY 

Pasadena, California 91125 (US) 

(72) Inventors: 

• MAYO, Stepiien 
Pasadena, CA 91107 (US) 

• DAHIYAT, Bassil, I. 

Los Angeles, CA 90024 (US) 



• GORDON, D., Benjamin 
Pasadena, CA 91106 (US) 

• STREET, Artliur 

Los Angeles, CA 90027 (US) 

(74) Representative: Kiddle, Simon John et al 
Mewburn Ellis, 
Yorlc IHouse, 
23 Kingsway 
London WC2B 6HP (GB) 

(56) References cited: 

• DAHIYAT ET AL: "protein design automation" 
PROTEIN SCIENCE, vol. 5, no. 5, 1996, pages 
895-903, XP002073372 us cited in the application 

• DESMET ET AU "theoretical and algorithmlcal 
optimization of the dead-end elimination 
theorem" PROCEEDINGS OF THE PACIFIC 
SYMPOSIUM ON BIOCOMPUTING '97, 6 - 9 
January 1997, pages 122-133, XP002073373 us 



Note: Within nine months from the publication of the mention of the grant of the European patent, any person may give 
notice to the European Patent Office of opposition to the European patent granted. Notice of opposition shall be filed in 
a written reasoned statement. It shall not be deemed to have been filed unUI the opposition fee has been paid. (Art. 
99(1) European Patent Convention). 



Printed by Jouve. 75001 PARIS (FR) 



EP 0 974 111 B1 

Description 



40 



45 



50 



55 



[0001 J This application is a continuing application of U S q w ftn/n.iQ ^ . . 

August 4. 1997. and 60/061.097, filed Octo^er^ T997 ^0/043.464, filed April 11. 1997, 60/054.678. fifed 

FIELD OF THE INVENTION 

« The p^sen. inven«oo .e.ates to an apparatus an. .ethod ,uan«.a«. p.ot«n design and opt.„,.a«on. 
10 BACKGROUND OF THE INVENTION 

'^^^^<^Z\T^:S^Ts^:!^^^^ ^ ^^^^ advances Have ^ 

on knowledge of the physical properties thatlien^trn otl ^""^ *° ^^^^ 

« Mrophillc,esidueslntheseque^e.sal.bridg^ an™ '^"^^ hydrophobic and' 

acids. Various approaches to apply these prindple^ haS tepn c '^^'^ ^^^'^ preferences of amino 

and p-sheet proteins with nativ^Hke sequels vSs atfloTed S h ,^ T'"'""' ^^^^ o, «-helical 
position in the target fold (Hecht. ef a^ S^ce 249 Kr^"^ ^ '^^""^ ^very 

8747-8751 (1994)). Altemativelv a mini^sS,, ! ^ ^* P^oc- Natl. Acad Sri USA 91- 

» sible sequence beLverbelj.slZ^rthe'^Ssrcr^^ 'f'"' P^^ ^'"^- where the simZ 'p!! 

(1988): OeGrado. ef a/.. Sdence 243:K2 0 Sg)^^^^^^^ 

grees of success. An ex^iSS^tal method I renes onThll^ a' ■ ^lence 261:879^5 (1993)). with varying de- 
developed Where a fib,^ of sequenc^ ^mte ^"t^^^^^^^^^ (^'P) P-"--" of a sequeL was 
mutagenesis (Kamtekar. etal.. Science 262:168?iSS lonn JL ? ^^"^^'^ 
s occurring proteins have been or coupfed toqS 

Se 9o:r:;iroK^^^^^^^^ .etho^s .o prcem design 

J. Mol. Biol . 224-1143-1154 (1Qq9\' n^^l^.i.^ZT "^"'"9a> ^^aL, J. Mol. Biol. 222: 763-785(1991)- Hurlev e^p/ 

34:11645-11651(1995): Betzo/ef a/. Biochemis1^35^^l^of«^^^^^^ ''^""y^'' ^'^Z- 

(1996): Jones. Protein Science 3-567' i^i7TS^™!fj!ff ^ ''^^'y^*' Protein Sden^^ l^M^ 

(19m These Wf^ii^sWerLXZ^^^^^^ 

el.ng the atoms of sequences under co^i^iersZT^^jSZ^Tr^!" °' ^'^''^''^ 

cores of proteins and have scored sequences Jo^l^^r'^ techniques have typically focused on designing the 

[0006] The document DAHIYAT ET AL W^de^^^^^^ hydrophobic solvation potentials. 

pages 895-903 discloses a method comoriJ^^aZ XZ^^r'r (US), vol. 5 »no. 5, 1996. 

foreachofv^ichitestabllshesagroupofpotenLuoL^'v^Jl^^^ 

from at least two different amino acid side chains T^^e^^t^^^m^ T^^ 

mfamers with the remainder of the backbone to generate a^ToJc^tS ^ T'^^' ^""^^ °' the 
step includes the use of the Van der Waals and sZ^^L Ta T '^^ P"^" sequences, wherein the analysing 

I0007I In addition, the Qualitative n^^Jo lTd:^^^^^^^^^ -^"9 '«ns. 

^^^cond generation, proteins because there are n^o ^=:S:s^^1eX^ln^^^^^^ 

5:i«:ei:^"rc;:t^^^^^^ 

SUMMARY OF THE INVENTION 

I0009I The invention is set out in appended method claim 1 and compute medium daim 15. 
BRIEF DESCRIPTION OF THE DRAWINGS 

[00101 Pigurel illustratesageneralpurposecomputerconfigured 



•n accordance witt, an embodiment of the invention. 



2 



EP 0 974 111 B1 



1001 IJ Figure 2 illustrates processing steps associated with an embodiment of the Invention. 

[0012] Figure 3 illustrates processing steps assodated with a ranking module used In accordance with an emtxxii- 

ment of the Invention. After any DEE step, any one of the previous DEE steps may be repeated. In addition, any one 

of the DEE steps may be eliminated; for example, original singles DEE (step 74) need not be run. 

[0013] Figure 4 depicts the protein design automation cycle. 

[0014] Figure 5 depicts the helical w^heel diagram of a coiled coll. One heptad repeat Is shown viewed down the 
major axes of the helices. The a and d positions define the solvent-inaccessible core of the molecule (Cohen & Pany, 
1990, Proteins, Stmcture, Function and Genetics 7:1-15). 

[001 5] Rgures 6A and 6B depict the comparison of simulation cost functions to experimental Tm's. Figure 6A depicts 
the initial cost function, which contains only a van der Waals tenri for the eight PDA peptides. Figure 6B depicts the 
Improved cost function containing polar and nonpolar surface area terms weighted by atomic solvation parameters 
derived from QSAR analysis; 16 cal/mol/A^ favors hydrophobic surface burial. 

[0016] Figure 7 shows the rank con-elation of energy predicted by the simulation module versus the combined acUvity 
score of X repressor mutants (Urn. et aL, J. Mol. Biol. 219:359-376 (1991 ); Helllnga. et ai, Proc. Nat t. Acad Sci USA 
91:5803-5807 (1994)). 

[001 7] Figure 8 shows the sequence of pdaSd aligned with the second zinc finger of Zif268. The boxed oslUons were 
designed using the sequence selection algorithm. The coordinates of PDB record Izaa (Paveletch, ef a/., Science 252: 
809-817 (1991)) from residues 33-60 were used as the structure template. In our numbering, position \ corresponds 
to Izaa position 33. 

[001 8] Figures 9A and 9B shows the NMR spectra and solution secondary stmcture of pdaSd from Example 3. Figure 
9A is the TOCSY Ha-HN fingerprint region of pdaOd. Figure 9B Is the N MR NOE connectivities of pda8d. Bars represent 
unambiguous connectivities and the bar thickness of the sequential connections Is indexed to the intensity of the res- 
onance. 

[001 9] Figures 1 0A and 10B depict the secondary stmcture content and thermal stablRty of a90. a85, a70 and a1 07. 
Figure 10A depicts the far UV spectra (circular dichroism). Figure 1 0B depicts the thermal denaturaUon monitored by 
CD. 

[0020] Figure 1 1 epicts the sequence of FSD-1 of Example 5 aligned with the second zinc finger of Zlf268. The bar 
at the top of the figure shows the residue position classlficatfons: solki bars indicate core positions, hatched bars 
indicate boundary positions and open bars indicate surface positions. The alignment matches positions of FSD-1 to 
the con-esponding backbone template positions of Zif268. Of the six identical positions (21%) between FSD-1 and 
Zlf268, four are buried (Ile7, Phe12. Leu18 and Ile22). The zinc binding resWues of Zif268 are boxed. Representative 
non-optimal sequence solutions determined using a Monte Cario simulated annealing protocol are shown with their 
rank. Vertfcal lines Indicate identity with FSD-1. The symbols at the bottom the figure show the degee of sequence 
conservation for each residue position computed across the top 1000 sequences: filled circles indicate greater than 
99% conservation, half-filled drdes indicate consen^ation between 90 and 99%, open drcles indicate consenratlon 
between 50 and 90%, and the absence of symbol Indicates less than 50% conservation. The consensus sequence 
detemiined by choosing the amino add with the highest occurrence at each positk)n is identical to ttie sequence of 
FSD-1. 

[0021] Figure 12 is a schematic representation of the minimum and maximum quantities (defined in Eq. 24 to 27) 
that are used to constmct speed enhancements. The minima and maxima are utilized directly to find the lljjf^b Pa'*" 
and for the comparison of extrema. The differences between ttie quantities, denoted with arrows, are used to constmct 
the and metrics. 

[0022] Figures 13A, 13B. 13C, 13D, 13E and 13F depicts the areas Involved in calculating the buried and exposed 
areas of Equations 18 and 19. The dashed box is ttie protein template, the heavy solid lines correspond to three 
rotamers at three different residue positions, and the lighter solid lines con-espond to surface areas, a) A^.^^ for each 
rolamer. b) for each rotamer. c) (^0^3- A,^ summed over the ttiree residues. The upper residue does not bury any 
area against ttie template except that buried In the tri-peptide state A^i^^-6) A^^^ for one pair of rotamers. e) The area 
buned between rotamers. (A-, ^+ VAv/y). ^<5r the same pair of rotamerff^as In (d). f) The area buried between rotamers. 
(^iA '^^jsT^idJ^' summed over the three pairs of rotamers. The area b intersected by all three rotamers is counted twice 
and Is indicated by the double lines. The buried area cateulated by Equation 18 is the area buried by the template, 
represented in (c), plus s times the area buried between rotamers. represented In (f). The scaling factor s accounts for 
the over-counting shown by the double lines in (f). The exposed area calculated by Equation 19 Is the exposed are in 
the presence of ttie template, represented In (b). minus s times the area buried between rotamers. represented In (f). 

DETAILED DESCRIPTION OF THE IhA/ENTION 

[0023] The present invention is directed to the quantitative design and optimization of amino add sequences, using 
an -inverse protein folding'' approach. whk:h seeks the optimal sequence for a desired stmcture. Inverse folding is 



EP 0 974 111 B1 



55 



a given sequence. ^ ^ approach which attempts to predict a structure taken by 

^ szj'^r r^sre^^rr ^^^^^ trJiT' ^^^^^ ^^.s a. 

tified. Which may be the entire sequence or suSsTtiereof t« ^ l*"^ '^f^ *° Wen- 
removed. The resulting stmctu,« insisting of^ep^^^^ 
Each variable residue position is .hen prefeS da^^^^^^ 

each dasslflcaSon defines a subset of possiWe amTrf "sirrs wl^ ' '^^'"^^ '"""dary residue; 

'0 erally w<|| be selected from the set of hydrophobic rsidSs su^ ' ^ h ^'^ 9«"- 

dn^hOic residues, and boundary residuL l^e^^^^^^ZnT^ "^"^^ 
allowed conformers of each side chain called rotame^ Tl^r ^ . '^^^^^ by a discrete set of all 
possible sequences of rotamer, n.us. be s^^'^erl 'TbaZn^^ Z^"""' ' 
amino add in all Us possible ratameric states nr .he . f ^^''^^ Posrtion can be occupied either by each 

« [0025] Two sets of^cti^^^^5;i;^trJ^^^^^^^^^ 
sidechainwithallorpartottheba^(^^?nS^^^^^ 

energy), and the interaction of the rolamerlde SZtM^^^S!": "^"'^^"«'"P'««« rotamer/backbone 
of the other positions (the "doubles" energy 21 rn^^^ '"^^ 

interacuons is calculated through the use .^a v«?e^ ofs<SrinatT T T^^' of each of these 

'0 forces, the energy of hydroge^ bonding of "^^^^^ 

solvation and the elect«»,aL. Thus tte toLT^emv If IT? '^'^^ ^"^9^ area 

rofarr-ers, is calculated, and storedS'a maW^^^Jn^' ""'^ and ot»S 

E A^^c^trof^S, n S^rSbTe'^Lre*^'''^ °' sequences .o be 

« number which grows exponVntal^i^^Sl^ ^rT' T'"":! rotam J^sequ^s.1 

real time. Accordingly, to'solve misSmS^S^ ^Ts''^\%Z^!r ""^'^"^^ ^ 
formed. The DEE calculation is based on the factSi^!^^; Lm , t (DEE) calculation is per- 

the best total interaction of a second rotamer S, he s^„h^ °' ^ better than 

Since the energies of all rotamers havTZ'^ C S^^t^^r"' T °' "P"'""- 
» quence length to test and eliminate rotamers ZSTs^^i^^^^^'^'*' ""'^ '^"^'^^ ^""^^ 'he se- 
paring pairs of rotamers. or combinations o ^ZZT:!:^^. 'T^'?"' considerably. DEE can be rerun com- 
sequence which represents the global optimu,!!^!!^; "^"^ determination of a single 

10027] Once the global solution has been found a Monte Carlo «=ir^h „ 

Of sequences in the neighbortiood of the DEE solution TtLrffn^t^^r^"^ """^ '° ^ rank-ordered list 

* ofherrotamers, and the new sequence energy is SSd 7th! ™w ff P''^"""^ changed to 

.™a.ngpointforanc.eriump.^Lapr:Srer^^ 

, EdbTeSera^StrS^^^^^^ 

' modify the procedure if necessary. ^'"^ ""^ '^^""S then be fed back into the analysis, to 

[0029] Thus, the present invention provides a computer-assisted mathnH ^ w« • • 

pnses providing a protein backbone stmcture with van^ble r^tfn?!^ v a protein. The method com- 

rotamers for each of the residue positions^ LTed h^Ifn th^h '1 ^"^ «'"ablishing a group of potential 

and any f«ed side chains. The mteracti" s be^ '^^^^^^ "«="^ ""e backbone atoms 

pairs of the potential rotamers, are then proc^S tl aen«! » ^ T ^ ^'^ between 

irw Jon. ^a^P^r^O^cir^rnCllS:^^^^^ ^ ^ embo,«men. of 

of mput/output devfces {e.g., keyboard, mouse m^ to pni e^T^sTT^^ " ""^"^ ^4 and a set 
between a central processing unit 22. a memory 2TTd,ZT:/^1 9^"^' '"^'^ 

present invention Is directed toward tl^ automaTed oro,«n H«i , ' ^"^ ^ """^ "'e art. The 

[0031] The automated protein design praaram 30 r^l h T ""^"^ ^ 24. 
in detail betow. the side cLn modriteEs a gZ of Z^TT'f ' '=*'^'" As discussed 

ture. The protein design program 30 may ali^ X^nTeSa rrn^"'" f P'*''^'" '^'-^ ^"^"^ 

the ranking module 34 analyzes ti,e inl^ction of ^01^^" wShT/n !?^^^^^ ^- '^^^ detail below, 
protein sequences. The protein design program M rv aSinHnH ""'^^^''^ ««> 9«n«rale optimized 

example a Monte Carto search as deL^^b^n^tetiS^to m« n r "^T ^ ^^^"^ ^ 

mentmodu,e3Smaya.obeused.oassessphysi.r^re:rr^^^^ 



4 



EP 0 974 111 B1 



further below. 

[0032] The memory 24 also stores a protein backbone staicture 40, which is downloaded by a user through the Input/ 
output devices 26. The memory 24 also stores InfonDation on potential rotamers derived by the side chain module 32. 
In addition, the memory 24 stores protein sequences 44 generated by the ranking module 34. The protein sequences 
44 may be passed as output to the input/output devices 26. 

[0033] The operation of the automated protein design apparatus 20 is more fully appreciated with reference to Rg. 
2. Fig. 2 illustrates processing steps executed in accordance with the method of the Invention. As described below, 
rnany of the processing steps are executed by the protein design program 30. The first processing step Illustrated iri 
Fig. 2 Is to provide a protein backbone structure (step 50). As previously indicated, the protein backbone stmcture is 
downloaded through the input/output devices 26 using standard techniques. 

[0034] The protein backbone staicture con-esponds to a selected protein. By •protein* herein is meant at least two 
amino acids linked together by a peptide bond. As used herein, protein Includes proteins, oligopeptides and peptides. 
The peptidyl group may comprise naturally occuning amino acids and peptide bonds, or synthetic peptidomimetic 
stmctures. i.e. "analogs", such as peptoids (see Simon etat., PNAS USA 89(20):9367 (1992)).. The amino acids may 
either be naturally occuring or non-naturally occuring; as will be appreciated by those in the art. any structure for which 
a set of rotamers Is known or can be generated can be used as an amino add. The side chains may be in either the 
(R) or the (S) configuration. In a prefen-ed embodiment, the amino adds are in the (S) or L-configuration. 
[0035] The chosen protein may be any protein for which a three dimensional structure Is known or can be generated; 
that is, for which there are three dimensional coordinates for each atom of the protein. Generally this can be determined 
using X-ray crystallographic techniques, NMR techniques, de novo modening, homology modelling, etc. In general, if 
X-ray structures are used, structures at 2A resolution or better are prefen-ed, but not required. 
[0036] The proteins may be from any organism, Induding prokaryotes and eukaryotes. with enzymes from bacteria, 
fungi, extremeophiles such as the archebacteria, insects, fish, animals (particularly mammals and particularly human) 
and birds all possible. 

[0037] Suitable proteins indude, but are not limited to. Industrial and pharmaceutical proteins, induding llgands, cell 
surface receptors, antigens, antibodies, cytokines, hormones, and enzymes. Suitable classes of enzymes Include, but 
are not limited to, hydrolases such as proteases, carbohydrases, lipases; Isomerases such as racemases. epimerases, 
tautomerases. or mutases; transferases, kinases, oxidoreductases, and phophatases. Suitable enzymes are listed iri 
the Swiss-Prot enzyme datat>ase. 

[0038] Suitable protein backbones indude, but are not limited to, all of those found in the protein data base compiled 
and sen^iced by the Brookhaven National Lab. 

[0039] Specifically induded within "protein" are fragments and domains of known proteins. Induding functional do- 
mains such as enzymatfc domains, binding domains, etc., and smaller fragments, such as turns, loops, etc. That is, 
portions of proteins may be used as well. 

[0040J Once the protein Is chosen, the protein backbone structure is input Into the computer. By "prc)teln backbone 
structure" or grammatical equivalents herein is meant the three dimensional coordinates that define the three dimen- 
sional structure of a particular protein. The slnjctures vi^ich comprise a protein backbone stmcture (of a naturally 
occuring protein) are the nitrogen, the carbonyl carbon, the a-carbon, and the carbonyl oxygen, along with the direction 
of the vedor from the a-caitson to the p-cartxxi. 

[0041] The protein backbone structure which is input Into the computer can either indude the coordinates for both 
the backbone and the amino add side chains, or just the backbone, i.e. with the coordinates for the amino acid side 
chains removed. If the former is done, the side chain atoms of each amino add of the protein slnjclure may be "stri pped" 
or removed fi^om the structure of a protein, as is known in the art, leaving only the coordinates for the "backbone" atoms 
(the nitrogen, carbonyl carbon and oxygen, and the a-carbon. and the hydrogens attached to the nitrogen and a- 
carbon). 

[0042] After inputing the protein stmcture backbone. expDdl hydrogens are added if not Induded within the stmcture 
(for example. If the stmcture was generated by X-ray crystallography, hydrogens must be added). After hydrogen 
addition, energy minimization of the stmcture is mn, to relax the hydrogens as well as the other atoms, bond angles 
and bond lengths. In a preferred embodiment, this is done by doing a number of steps of conjugate gradient minimization 
(Mayo et ai, J. Phys. Chem, 94:8897 (1990)) of atomfc coordinate positions to minimize the Dreiding force field with 
no electrostatics. Generally from about 10 to about 250 steps is prefen-ed. with about 50 being most preferred. 
[0043] The protein backbone structure contains at least one variable residue position. As is known in the art. the 
residues, or amino adds, of proteins are generally sequentially numbered starting with the N-terminus of the protein 
Thus a protein having a methionine at it's N-lemiinus is said to have a methionine at residue or amino acid position 1 
with the next residues as 2. 3. 4. etc. At each position, the wild type (i.e. naturally occuring) protein may have one of 
at least 20 amino acids, in any number of rotamers. By "variable residue positkjn" herein is meant an amino acid 
position of the protein to be designed that is not fixed in the design method as a spedfic residue or rotamer. generally 
the wild-type residue or rotamer. 



EP 0 974 111 B1 



20 



25 



30 



35 



40 



45 



SO 



55 



[0044) In a preferred embodiment, all of the residue positions of the orot«n ar« woKok. tu . • 

amino acid Alternatively. tf,e memoes oma lSnvtl^^^^^ ^.'^f ""^ ^ ^ P^^-^"'^^ 

residues which form the active site of an enzvme the s..h<=f n,^ wl^i . important for biological activity, such as the 

to fix rej:dues as non^i^Jre aJ^lTd*^^^^ '^'^'^ ^''^ " is possible 

the side chain module 32 includes at least one rof«mJ, m!«f- T u ! u ""^ embodiment of the Invention, 
the selected protein badZ^'r^^lTJ^^^^ ^r"!*"* ^ »^^« ^"^^^^^^ 

Chain modui^aa may be o:Z^7uZZ^Z^:tT^^::T^^ si<^e 
dovmloaded through the input/output devices 26 bacltbone structure may be 

Desmet. etal.. Nature 3S6-539.542Vl9a? «n k J f ^' Struc Biol. 1(5):334-340 (1994); 

Thus, a set dSt! rofme. c^r ev2^^^^ 

libraries: backbone dependent an7bSne We^H!nf 11"^^ *^ ^ 

rotamers depending onT JosZ,5 i^^esiduel^r^^^ .^"IT^ 1"^""'*""' """'"^^ ""'^y ^""^'^ ^^^^'^^ 
allowed if the position is M^nMsZ^^rZ^ ""^ "^'^^ are 

A backbone iriLpend^tTo^mer libirj Jt^Ss^^^^^ !S '["^ if the position is not in a a-helix. 

independent library is preferred in ^^rsSl^^'nTc^^^^^^^^^ ^ '^ckbone 

backbone independent libraries are computationally mor^exiS I^S!.^ 

[0053, '-'•^•--P-ferredembod^en.doesatypeof-nnetuning-oftherotamerlibr^rb^ 



6 



EP 0 974111 B1 



X (chi) angle values of the rotamers by plus and minus one standard deviation (or more) about the mean value, in order 
lo minimize possible errors that might arise from the discreteness of the library. This is particulari y important for aromatic 
residues, and falriy important for hydrophobic residues, due to the Increased requirements for flexibility in the core and 
the rigidity of aromatic rings; it Is not as Important for the other residues. Thus a prefen^ed embodiment expands the 
Xi and X2 angles for all amino adds except Met, Arg and Lys. 

[0054] To roughly Illustrate the numbers of rotamers, in one version of the Dunbrack & Karplus backbone-dependent 
rotamer library, alanine has 1 rotamer, glycine has 1 rotamer. arginine has 55 rotamers, threonine has 9 rotamers, 
lysine has 57 rotamers, glutamic acid has 69 rotamers, asparaglne has 54 rotamers, aspartic acid has 27 rotamers] 
tryptophan has 54 rotamers, tyrosine has 36 rotamers, cysteine has 9 rotamers, glutamine has 69 rotamers, histidine 
has 54 rotamers, valine has 9 rotamers, Isdeudne has 45 rotamers, leucine has 36 rotamers, methionine has 21 
rotamers, serine has 9 rotamers, and phenylalanine has 36 rotamers. 

[00551 In general, proline is not generally used, since it will rarely be chosen for any position, although it can be 
included if desired. Similariy, a preferred embodiment omits cysteine as a conslderatfon. only to avoid potential disulfide 
problems, although it can be included if desired. 

[00561 As will be appreciated by those in the art, other rotamer libraries with all dihedral angles staggered can be 
used or generated. 

[00571 In a preferred embodiment, at a minimum, at least one variable position has rotamers from at least two different 
amino acid side chains; that is, a sequence is being optimized, rather than a structure. 

[00581 In a prefenred embodiment, rotamers from all of the amino adds (or all of them except cysteine, glycine and 
proline) are used for each variable residue position; that Is, the group or set of potential rotamers at each variable 
position is every possible rotamer of each amino add. This is espedally prefen-ed when the number of variable positions 
is not high as this type of analysis can be computationally expensive. 

[00591 In a preferred embodiment, each variable position is classified as either a core, surface or twundary residue 
position, although in some cases, as explained below, the variable position may be set to glydne to minimize backbone 
strain. 

[00601 I* should be understood that quantitative protein design or optimization studies prior to the present invention 
focused almost exduslvely on core residues. The present invention, however, provides methods for designing proteins 
containing core, surface and boundary positions. Alternate embodiments utilize methods for designing proteins con- 
taining core and surface residues, core and boundary residues, and surface and boundary residues, as well as core 
residues alone (using the scormg functions of the present invention), surface residues alone, or boundary residues 
alone. 

[00611 The dassification of residue positions as core, surface or boundary may be done in several ways, as will be 
appreciated by those in the art In a preferred embodiment, the dassification Is done via a visual scan of the original 
protein backbone structure, induding the side chains, and assigning a dassification based on a subjective evaluation 
of one skilled in the art of protein modelling. Alternatively, a preferred embodiment utilizes an assessment of the ori- 
entation of the Ca-Cp vectors relative to a solvent accessible surface computed using only the template Ca atoms. In 
a preferred embodiment, the solvent accessible surface for only the Ca atoms of the target fold is generated using the 
Connolly algorithm with a probe radius ranging from about 4 to about 12A, with from about 6 to about lOA being 
preferred, and 8 A being particulariy preferred. The Ca radius used ranges from about 1.6A to about 2.3A, with from 
about 1 -8 to about 2. 1 A being preferred, and 1 .95 A being espedally preferred. A residue Is classified as a core position 
if a) the distance for its Ca. along its Ca-Cp vector, to the solvent accessible surface is greater than about 4-6 A, with 
greater than about 5.0 A being espedally preferred, and b) the distance for its cp lo the nearest surface point is greater 
than about 1 .5-3 A. witti greater than about 2.0 A being especially preferred. The remaining residues are dassified as 
surface positions If the sum of the distances from their Ca, along their Ca-Cp vector, to the solvent accessible surface, 
plus the distance from their cp to the closest surface point was less than about 2.5-4 A, with less than about 2.7 A 
being especially prefen-ed. All remaining residues are classified as boundary positions. 

[00621 Once each variable position is classified as either core, surface or boundary, a set of amino acid side chains, 
and thus a set of rotamers. is assigned to "each position. That is. the Set of possible amino add side chains that the 
program will allow to be considered at any particular position is chosen. Subsequentty. once the possible amino add 
side chains are chosen, the set of rotamers that will be evaluated at a particular position can be determined. Thus, a 
core residue will generally be selected from the group of hydrophobic residues consisting of alanine, valine, isdeudrle. 
leudne. phenylalanine, tyrosine, tryptophan, and methionine (In some embodiments, when the a scaling factor of the 
van der Waals scoring function, described below. Is low. methtonine is removed from the set), and the rotamer set for 
each core position potentially indudes rotamers for these eight amino add side chains (all the rotamers if a backbone 
independent library is used, and subsets if a rotamer dependent backbone is used). Similariy, surface positions are 
generally selected from the group of hydrophllte residues consisting of alanine, serine, threonine, aspartic acid, aspar- 
agine, glutamine, glutamic add. arginine. lysine and histidine. The rotamer set for each surface position thus includes 
rotamers for these ten residues. Finally, boundary positions are generally chosen from alanine, serine, threonine, as- 



EP 0 974 111 B1 



15 



20 



25 



30 



35 



40 



45 



tamer for mese seventeen res.dues (asslTrS^ttt L!^^ f^'^*^"^ '"^""^^ eve^y ro- 

[0063] Thus, as will be appreciated tT^^ZriTe art tC. ,t ^"^^^ ^" 

positions, as i, decreases th'e number I'T^:^!'. ^1^:^^:^:^^^"^ '^^"''^"^ ''^'"^ 
sets of core, boundary and surface residues are altered fromTh^^ ^ ''^ situations where the 

Stances, one or more amino acids is eZ ZToTi^^'^J^V'^T^^^^ '"'^^ 
some proteins which dimerize or multimerize or have liaanSnn^.i^ 1°*^ «'<^'"P'«- 

etc. In addition, residues that do not allow heHx or ff^ ^^^^^ic surface residues, 

subtracted from a set of allowed resi^^ TOs modSn o iSn^^ "^^V 
100641 Inapreferred embodiment. prolinr,^steSnH^ w ^^^^^^ 

side Chains, and thus the rotamrs forC^e siHai^ af^^^^^ '"^'^ "^^ P°^"« amino add 

operations. Thatis.computerc^ is wSe7to^mpZ^^^^^^ may be used to perform these 
above, the processing i^Hially comprisri^e use S^nu!^^^^ 

scoring funcHon. As is further descn-bTST; aS o^^^^ "^"^ ^ ^ electrostatic 

Ihe scoring functions may differ d^pen^Z'^ SZc^^Ir 1 *° '"^'^ a'»«"Sh 

action with an «-helix dipole. As o^Ui^LteTtf^ ,?aT^^^^^^ considerations, like favorable int^- 

energyofeachscoringfunct^usedraS:L.?pSl^^^^^^ 

Equation 1 

energy of electrostatic interaction (E., ) tei3rih», n''"*^: . °' secondary structure (E^) and the 
for the particular residue ,o^tion!it.^.l^ZS^l^'-'^'^°"^^^^ ^ ^ ^e^nsidered 

^:^Li"::Te:z::^::::;: r^^^^rsr- " r - ^-.s. 

electron repulsion (Paul! prindple) forceT ""^^ ^"^ molecules, that is. the induced dipole and 

KLsrere^^^car^^^^^^^ 

from the Dreiding force field. Mayo et al 7S Chem -^n T"'^ ^ ^"^ ^" "^^P^^" Parameters 
exponential 6 potent^i. a,u.^„ 2. shown-b5S^p;r,iiir^jZ'^^^^^ °^ 



50 



Equation 2 



55 



[0070] Rq is the geometric mean of the van der Waak r^rf;; of ik 

geometric mean of the well depth of the two aiol u^t ^ nlil^- ""^^ conslderatfon. and Do is the 



8 



EP 0 974111 B1 



Equation 3 



12 



-2 



OR, 



10 



IS 



20 



25 



30 



35 



40 



45 



50 



55 



[00721 The role of the a scaling factor is to change the importance of packing effects in the optimization and design 
of any particular protein. As discussed In the Examples, different values for a result in different sequences being gen- 
erated by the present methods. Speciflcaily, a reduced van der Waals steric constraint can compensate for the restrictive 
effect of a fixed backbone and discrete side-chain rotamers in the simulation and can allow a broader sampling of 
sequences compatible with a desired fold. In a preferred embodiment, a values ranging from about 0.70 to about 1.10 
can be used, witti a values from about 0.8 to about 1.05 being preferred, and from about 0.85 to about 1.0 being 
especially preferred. Specific a values which are preferred are 0.80. 0.85, 0.90. 0.95. 1.00, and 1.05. 
[0073] Generally speaking, variatton of the van der Waals scale factor a results in four regimes of packing specificity: 
regime 1 where 0.9 5 a S 1.05 and packing constraints dominate the sequence selection: regime 2 where 0.8 ^ a < 
0.9 and the hydrophobic solvation potential begins to compete with packing forces; regime 3 where a < 0.8 and hy- 
drophobe solvation dominates the design; and, regime 4 where a > 1.05 and van der Waals repulsions appear to be 
too severe to allow meaningful sequence selection. In particular, different a values may be used for core, surface and 
boundary positions, with regimes 1 and 2 being preferred for core residues, regime 1 being preferred for surface res- 
idues, and regime 1 and 2 being preferred for boundary residues. 

[0074] In a preferred embodiment, the van der Waals scaling factor is used in the total energy calculations for each 
variable residue position, including core, surface and boundary positions. 

[0075J In a preferred embodiment, an atomic solvation potential scoring function Is used. As is appreciated by those 
in the art, solvent interactions of a protein are a significant factor in protein stability, and residue/protein hydrophobicity 
has been shown to be the major driving force in protein folding. Thus, there is an entropic cost to solvating hydrophobic 
surfaces, in addition to the potential for mlsfdding or aggregation. Accordingly, the burial of hydrophobic surfaces within 
a protein structure Is beneficial to both folding and stability. Similarly, there can be a disadvantage for burying hydrophllic 
residues. The accessible surface area of a protein atom is generally defined as the area of the surface over which a 
water molecule can be placed whfle making van der Waals contact witfi this atom and not penetrating any other protein 
atom. Thus, in a preferred embodiment, the solvation potential is generally scored by taking the total possible exposed 
surface area of the moiety or two independent moieties (either a rotamer or the first rotamer and the second rotamer). 
which is the reference, and subtracting out the "buried" area. i.e. the area which is not solvent exposed due to inter- 
actions either with the backfcwne or with other rotamers. This thus gives the exposed surface area. 
[0076] Alternatively, a preferred embodiment calculates the scoring function on the basis of the "buried" portion; i.e. 
the total possible exposed surface area is calculated, and then the calculated surface area after the interaction of the 
moieties is subtracted, leaving the buried surface area. A partrcularly preferred method does both of these calculations. 
[0077J As is more fully described below, both of these methods can be done in a variety of ways See Elsenberg et 
aL, Nature 319:199-203 (1986); Connolly, Science 221:709-713 (1983); and Wodak, et a/.. Proc. Natl. Acad. Sci. USA 
77(4): 1 736-1 740 (1980). all of which are expressly incorporated herein by reference. As will be appreciated by those 
in the art. this solvation potential scoring function is confomfiation dependent, rather than conformation independent 
[0078] In a preferred embodiment, the painMse solvation potential is Implemented in two components, "singles" 
(rotamer/template) and "doubles" (rotamer/rotamer), as is more fully described below. For the rotamer/templ'ate buried 
area, the reference state is defined as the rotamer in questkDn at residue position i with the backbone atoms only of 
residues i-1 . i and i+1 . although in some instances just I may be used. Thus, in a preferred embodiment, the solvation 
potential is not calculated for the Interaction of each backbone atom with a particular rotamer. although more may be 
done as required. The area of the side chain is calculated with the backbone atoms excluding solvent but not counted 
in the area. The folded state is defined as the area of the rotamer in question at residue i. but now in the context of the 
entire template stmcture including non-optimized side chains, i.e. every other foxed posiUon residue. The rotamer/ 
template buried area is the difference between the reference and the folded states. The rotamer/rotamer reference 
area can be done in two ways; one by using simply the sum of the areas of the Isolated rotamers; the second includes 
the full backbone. The folded state is the area of the two rotamers placed in their relative posHions on the protein 
scaffold but with no template atoms present In a preferred embodiment, the Richards definition of solvent accessible 
surface area (Lee and Richards, J. Mol. Bid. 55:379-400. 1971, hereby Incorporated by reference) is used with a 
probe radius ranging from 0.8 to 1.6 A, with 1.4 A being preferred, and Drieding van der Waals radii, scaled from 0.8 
to 1.0. Carbon and sulfur, and ail attached hydrogens, are considered nonpolar. Nitrogen and oxygen, and all attached 



9 



1 



EP 0 974 111 B1 



10 



15 



30 



40 



45 



»he backbone). Since, as is generallyltHnl^ ,^2" '^'"^'^ ^ ^°'-"'er with 

only coa,pared to a second L^e'du^ w^X^^^^^ '"i^^- ^ - 

surface area in locations where more thanL roSe^ inSLrf I overestimate the amount of buried 

positions come together. Thus, a c^rrecbonTs^tr^cto^^^;!^^^! ' ""^ °' "^'^ ^^^idue 
100801 The general energy of solvation is ^o^X^^' 4 ^ 



Equation 4 
EM=f(SA) 



Phobic buried surface area is used. EquaWs JrirTpS? " """^ ^ Mro- 

20 

Equation 5 

^sa ~ ^ (SA^urigd hydrophobic) 



Equation 6 

Equation 7 

^sa = ',(SA,^,, hydrophobic) + ^3(SAe^,^^^^^) 



50 



55 



Equation 8 

^..=^(s^^h,.,op^).MSA^.,,^,^).,3(SA,,^^^^^^^^^ 

[00811 In a preferred embodiment, = -f^ 

'^!^^"ST^S£^:'^:::Z'' '"-•'^ -^^ace .eas. and values of 23 

tirp^r^^ts^n^sr::^ ^ 

-26 cal/mol/A2 (f,) and 100 cal/mol/A^ (f^Mr^rtern!^ ""^^ *° overcounting. In this embodiment, values of 

Smr:r^:i^s;s^ 

a calculation for core and bouXy residueSe^^gXe^f "3 Z^^^^^^ ^"'^--e residues, with bolh 

I0085J In a preferred embodimert. a MrogeS ootentf 2^ ^ f ^ combination of the three is possible, 
used as predicted hydrogen bonds do «SZe SoS l,l"f Kr°" ^ ^^^^^^^^ '^'^ Po««"«al is 

(1992): Huyghues-Despointes ef af.. Biochem 3^13X^99'^ ^ ' ^to"- 226:1143 

reference,. As out«„ed pre.ous.y. ex^id, hyd;ogens are ZlTjlTZ^: S^e'*^ 



10 



10 



15 



40 



45 



50 



55 



EP 0 974 111 B1 

[00861 In a preferred embodiment, the hydrogen bond potential consists of a distance-dependent term and an angle- 
dependent term, as shown in Equation 9: 

Equation 9 





12 





















(6,0.9) 



where Hq (2.8 A) and Oq (8 kcal/md) are the hydrogen-bond equilibrium distance and well-depth, respectively, and R 
is the donor to acceptor distance. This hydrogen bond potential is based on the potential used in DREIDING with more 
restrictive angle-dependent terms to limit the occurrence of unfavorable hydrogen bond geometries. The angle term 
varies depending on the hybridization state of the donor and acceptor, as shown In Equations 10, 11, 12 and 13. 
Equation 10 is used for sp3 donor to sp3 acceptor Equation 11 is used for sp^ donor to sp2 acceptor. Equation 12 is 
used for sp2 donor to sp3 acceptor, and Equation 13 is used for sp2 donor to sp2 acceptor: 

20 Equation 10 

F = cos^0cos^ (0-1 09.5) 

25 Equation 1 1 

F = cos^ecos^0 

30 Equation 12 

F = cos e 

35 Equation 13 

F = cos Ocos (max[4»,(p]) 



[0087] In Equations 1 0-1 3. 0 is the donor-hydrogen-acceptor angle, ^ is the hydrogen-acceptor-base angle (the base 
is the atom attached to the acceptor, for example the carbonyl carbon is the base for a carbonyl oxygen acceptor), and 
9 Is the angle between the nonnals of the planes defined by the six atoms attached to the sp2 centers (the supplement 
of 9 is used when 9 is less than 90*'). The hydrogen-bond function Is only evaluated when 2.6 A ^ R ^ 3.2 A, 6 > 90*, 
^ -1 09.5° < 90** for the sp3 donor - sp3 acceptor case, and, ^ 90*» for the sp^ donor - sp2 acceptor case; preferably! 
no switching functions are used. Template donors and acceptors that are involved in template-template hydrogen bonds 
are preferably not included in the donor and acceptor lists. For the purpose of exclusion, a template-template hydrogen 
bond is considered to exist when 2.5 A ^ R 5 3.3 A and 6 ^ 135*^. 

[0088] The hydrogen-bond potential may also be combined or used with a weal^ coulombic term that includes a 
distance-dependent dielectric constant of 40R. where R is the interatohrilc distance. Partial atomic charges are prefer- 
ably only applied to polar functional groups. A net formal charge of +1 is used for Arg and Lys and a net fomial charge 
of -1 is used for Asp and Glu: see Gasteiger, et at.. Tetrahedron 36:3219-3288 (1980); Rappe, et a/ J Phys Chem 
95:3358-3363(1991). '-^ ^ 

[00891 In a preferred embodiment, an explicit penalty is given for buried polar hydrogen atoms which are not hydrogen 
bonded to another atom. See Elsenberg, et aL, (1986) (supra), hereby expressly Incorporated by reference. In a pre- 
fen^ed embodiment, this penalty for polar hydrogen burial, is from about 0 to about 3 kcal/mol. with from about 1 to 
about 3 being preferred and 2 kcal/mol being particularly prefen-ed. This penalty is only applied to buried polar hydro- 
gens not involved in hydrogen bonds. A hydrogen bond Is considered to exist when Ehb ranges from about 1 to about 
4 kcal/mol. with Ehb 'ess than -2 kcal/mol being preferred. In addition, in a preferred embodiment, the penalty is not 
applied to template hydrogens, i.e. unpaired buried hydrogens of the backbone. 



11 



EP 0 974 111 B1 



10 



15 



20 



25 



take on a secondary stottlure. either a-helix or fr*h«,!T^ . ^^'^^ ^ certain prooenlh, to 

jDJiotech, 6:382 (1995); Minor, et a/.. Nature 365:6X^,^47 P ! ^"^'^^ ^"^^ aA.'SnX 

Munoz. era/., Foldng&Design Kajrli/Im (SVa^rl!!^'-.!!^'"^'^'^"' Nature 344:2685?^7S 
are expressly ii^^rporaled herein by ref^e '^^^'^^'y. ^A. Protein ScTli43 (1994? aH 

^m.^«inthebacR5one.asecondr;2Se^iS-Z^^^^ 

^ H the art, generally on the basis of A and w anate<=- w „ r'*,.'^'**^ » <telemirned as will be appredafed b« 



30 •-a-'" --"-I 



Bqualion 14 



35 



40 



45 



[0095] In Equation 14 E (nr pr\ »u 



50 



Equation 15 



55 



12 



EP 0 974 111 B1 



temptatd and all other rotamers, is done. However, as outlined above, it is possible to only model a portion of a protein, 
for example a domain of a larger protein, and thus In some cases, not all of the protein need be considered. 
[0102] In a prefen-ed embodiment, the first step of the computational processing is done by calculating two sets of 
interactions for each rotamer at every position (step 70 of figure 3): the Interaction of the rotamer side chain with the 
template or backbone (the "singles" energy), and the Interaction of the rotamer side chain with all other possible ro- 
tamers at every other position (the "doubles" energy), whether that position Is varied or floated. It should be understood 
that the backbone in tWs case includes both the atoms of the protein structure backbone, as well as the atoms of any 
fixed residues, wherein the fixed residues are defined as a particular conformation of an amino add. 
[0103] Thus, "singles" (rotamer/template) energies are calculated for the interaction of every possible rotamer at 
every variable residue position with the backbone, using some or all of the scoring functions. Thus, for the hydrogen 
bonding scoring function, every hydrogen bonding atom of the rotamer and every hydrogen bonding atom of the back- 
bone Is evaluated, and the E^b is calculated for each possible rotamer at every variable position. Similariy, for the van 
der Waals scoring function, every atom of the rotamer is compared to every atom of the template (generally excluding 
the backbone atoms of its own residue), and the E^^^ is calculated for each possible rotamer at every variable residue 
position. In addition, generally no van der Waals energy is calculated if the atoms are connected by three bonds or 
less. For the atomic solvation scoring function, the surface of the rotamer Is measured against the surface of the 
template, and the Eqs for each possible rotamer at every variable residue position is calculated. The secondary structure 
propensity scoring function is also considered as a singles energy, and thus the total singles energy may contain an 
Ess tef"!- As will be appreciated by those in the art, many of these energy terms will be close to zero, depending on 
the physical distance between the rotamer and the template position; that is. the farther apart the two moieties, the 
lower the energy. 

[01041 Accordingly, as outlined above, the total singles energy is the sum of the energy of each scoring function used 
at a particular position, as shown in Equation 1, wherein n is either 1 or zero, depending on whether that particular 
scoring function was used at the rotamer position: 

Equation 1 

Etotel = "^dw + "E3S + nEh.bonding "^53 ^^Btoc 

[0105] Once cakajlaled, each singles Efotai for each possible rotamer is stored in the memory 24 within the computer, 
such that it may be used in subsequent calculations, as outlined below. 

[01 06] For the calculation of "doubles" energy (rotamer/rotamer), the interaction energy of each possible rotamer is 
compared with every possible rotamer at all other variable residue positions. Thus, "doubles" energies are calculated 
for the interaction of every possible rotamer at every variable residue positron with every possible rotamer at every 
other variable residue position, using some or all of the scoring functions. Thus, for the hydrogen bonding scoring 
function, every hydrogen bonding atom of the first rotamer and every hydrogen bonding atom of every possible second 
rotamer is evaluated, and the Ehb calculated for each possible rotamer pair for any two variable positions. Similariy. 
for the van der Waals scoring function, every atom of the first rotamer is compared to every atom of every possible 
second rotamer. and the E^^,^ is calculated for each possible rotamer pair at every two variable residue positrons. For 
the atomic solvation scoring function, the surface of the first rotamer is measured against the surface of every possible 
second rotamer, and Uie E^s for each possible rotamer pair at every two variable residue positions Is calculated. The 
secondary stnjcture propensity scoring function need not be run as a "doubles" energy, as It is considered as a com- 
ponent of the "singles" energy As will be appreciated by those in the art, many of these double energy terms will be 
close to zero, depending on the physical distance between the first rotamer and the second rotamen that is, the farther 
apart the two moieties, the lower the energy. 

[0107] Accordingly, as outiined above, the total doubles energy Is the sum of ttie energy of each scoring function 
used to evaluate every possible pair of rotamers. as shown In Equation 16. wherein n is either 1 or zero, depending 
on whether that particular scoring function was used at the rotamer position: 

Equation 16 

Elotal = "Evdw + "^as + nEh.bondlng + ^elec 

10108] An example is illuminating. A first variable positron, i, has three (an unrealistically low number) possible ro- 
tamers (which may be either from a single amino acid or different amino adds) which are labelled ia. lb, and ic. A 
second variable position, j, also has three possible rotamers, labelled jd. je. and jf . Thus, nine doubles energies (Et^tai) 



13 



10 



15 



20 



25 



30 



35 



40 



EP 0 974 111 B1 

S'Siirra'^"'"' 

iwnoj Once the singles and doubles eneraiP«;ar«rw,!^.j^*«^ ^ . 
ing rr«y oca.. Generally speaWng. 3^:^^^^ 

sequences. By "optimized protein sequence" hereh te^^ Processing s to determine a set of optimized protein 
herein. As wiB be appreciated by th<L r^^e a S IbS"'^ "'^ mathen«tical equations 

Equaticn 1. i.e. the sequence mat t,as ,l,e T^^Z^:;' L'* '^^^^ '^atbest tits 

J sequences that are no. the global minimum but that Le fow Tr^L ""^'^^ 

t.ro2ii^s;iterrsi^^^^^^ 

r.T? °" ^ '^"'^ ^'o,\T£!:^^Z^^ '"^^ ^'""'-«°" program 

Lcipi^£~trir^^^^^^ 

versence. piovidingasef of sequences of which the 010^11. ^ corrt»na„ons but be stopped prior to con- 
fer example using a differen, method, may bet n on' 2,^17,:^^^^^ T'""' '"'^ «»"P^tional analysis, 
Altemalive.y, as is more fully described below theTbal olfimif ™. ^^"""^^^^^ rank them differenOy 
processing may occur, which generates addidon^oStd sZn^s 1^^^ computaUona'l 
0113J If a set comprising more than one optimiz^ pmtSi^u^ ^^"^ ^ <^^»^- 
emns of theoretical quantitaUve stability, as isLre MIylscnSlow 

eferridtoT^rXTS^^^^ 

the elimination of a« rotamers with template int^c^^rZsoZ^ZT ^ 

putahon. with elimination energies of greater than S 15 Hr^lf^ ^"^^ ^"^^^^ f"^ ^ Wcom- 

mol being especially prefer^!. Similarly. douWes^S^ln ^ ^ "'^^^'^ ^""^ ^^^^l^^ "han about 25 kcal/ 

than about 10 Kcal/mol «^th all rotame^ a, a sSSS;:e pcS^n'S ! 

eaed and greater than about 25 kcal/mol being espS^^e.! ' ^^'^ "^"g Pr- 

LrgL.';L::?LrcoCSt::SCS^ ^-nmna^on of total sequence 

o.er possib. sequences. . de^red. The enl3"a^-:Sr^^^^^^^ 



Equation 17 



45 



50 



55 



«keptconstant).thesinglesenergy for each rotamlri^^^^ 

energy for each rotamer pair (which has already beercScula^^n?c^ ^ ^ '^e doubles 
poss bie rofamer sequence can then be ranked eiSr^bStv^^^^^^^^ f '"^^ ^ each 

lor^y expensive and becomes unwieldy as the len^TZ^SZ^e^ ^ """"^'^ '^^'"^^ 

^ '«^^3^1^^0(1994Pr^f^^ Ch. 10:1-49(1^4): Goldstein. BiS' 

a rotamer can be eliminated from consWerS^ a paS^,!^^^^^ «he observatlS^^ 

rotamer is definitely not part of the gtobal opb^? conCm on T ' " 'determination that a partfcular 

companng the worst interaction (i.e. energyTr Z lZlul^ T °' '^^''^ " ™« done by 

act»n of a second rotamer at the same ^aWe^sitl ^ ""^'^ "^"^"'^ f^""" the best inter- 

the best interaction of the second rotame'Z rot m^' °' ""^ '^'^'"^ "^-er Jan 

•he sequence. The original DEE theorem s show^ i^, 0^13 "'"'""^ "^^ ''P*'^' «=»'rformabon of 



14 



EP 0 974 111 B1 



Equation 18 

E(ia) + j;[min over t{E(ia. jt)}) > E{ib) + Kmax over t{E(ib, jt))l 
i J 

[01171 In Equation 18, rolamer la is being compared to rotamer ib. The left side of the inequality is the best possible 
Interaction energy (Etotai) of ia with the rest of the protein; that is, "min over t" means find the rotamer t on position j 
that has the best Interaction with rotamer ia. Similarly, the right side of the inequality is the worst possible (max) inter- 
action energy of rotamer ib with the rest of the protein. If this Inequality is true, then rotamer ia is Dead-Ending and 
can be Eliminated. The speed of DEE comes from the fact that the theorem only requires sums over the sequence 
length to test and eliminate rotamers. 

[OIIBJ In a preferred embodiment, a variation of DEE Is performed. Goldstein DEE. based on Goldstein, (1994) 
(supra), hereby expressly incorporated by reference, is a variation of the DEE computation, as shown in Equation 19: 

Equation 19 

E(ia) - E(ib) + L[mln over t{E(ia. jt) - E(ib. jt)}] > 0 

[01 1 9] In essence, the Goldstein Equation 1 9 says that a first rotamer a of a particular position i (rotamer ia) will not 
contribute to a local energy minimum if the energy of confonnation with ia can always be lowered by just changing the 
rotamer at that position to ib, keeping the other residues equal. If this inequality is taie, then rotamer ia is Dead-Ending 
and can be Eliminated. 

[0120] Thus, In a preferred embodiment, a first DEE computation is done where rotamers at a single variable position 
are compared, ("singles" DEE) to eliminate rotamers at a single position. This analysis is repeated for every variable 
position, to eliminate as many single rotamers as possible. In addition, every time a rotamer is eliminated from con- 
sideration through DEE, the minimum and maximum calculations of Equation 18 or 19 change, depending on which 
DEE variation is used, thus conceivably allowing the elimination of further rotamers. Accordingly, the singles DEE 
computation can be repeated until no more rotamers can be eliminated; that is, when the inequality is not longer true 
such that all of them could conceivably be found on the global optimum. 

[01211 In a prefen^ed embodiment, "doubles" DEE is additionally done. In doubles DEE, pairs of rotamers are eval- 
uated; that is. a first rotamer at a first position and a second rotamer at a second position are compared to a third 
rotamer at the first position and a fourth rolamer at the second position, either using original or Goldstein DEE. Pairs 
are then flagged as nonallowable, although single rotamers cannot be eliminated, only the pair. Again, as for singles 
DEE. every time a rotamer pair is flagged as nonallowable, the minimum calculations of Equation 18 or 19 change 
(depending on which DEE variation is used) thus conceivably allowing the flagging of further rotamer pairs. Accordingly, 
the doubles DEE computation can be repeated until no more rotamer pairs can be flagged; that is, where the energy 
of rotamer pairs overiap such that all of them could conceivably be found on the global optimum. 
[0122] In addition, in a preferred embodiment, rotamer pairs are Initially prescreened to eliminate rotamer pairs prior 
to DEE. This is done by doing relatively computationally Inexpensive calculations to eliminate certain pairs up front. 
This may be done in several ways, as is outlined below. 

[0123] In a preferred embodiment, the rotamer pair with the lowest interaction energy with the rest of the system is 
found. Inspection of the energy distributions in sample matrices has revealed that an ij^ pair that dead-end eliminates 
a particular i pair can also eliminate other ij^ pairs. In fact, there are often a few i pairs, which we call "magic 
bullets." that eliminate a significant number of i pairs. We have found that one of the most potent magic bullets is the 
pair for which nnaximum interaction energy, t^axWJvDki. »s least. This pair is referred to as (I JXb- If this rotamer pair 
is used in the first round of doubles DEE, if tends to eliminate pairs fa&ter. 

[01241 Our first speed enhancement is to evaluate the first-order doubles calculation for only the matrix elements In 
the row corresponding to the [1 jj^^ Pa'r. The discovery of [ij^]^^ is an n2 calculation (n = the number of rotamers per 
position) , and the application of Equation 19 to tfie single row of the matrix corresponding to this rotamer pair is another 
n2 calculation, so the calculation time is smalt in comparison to a full first-order doubles calculation. In practice, this 
calculation produces a large number of dead-ending pairs, often enough to proceed to the next Iteration of sirigles 
elimination without any further searching of the doubles matrix. 

[0125] The magic bullet first-order calculation will also discover all dead-ending pairs that would be discovered by 
the Equation 18 or 19. thereby mailing it unnecessary. This stems fi-om the fact that emax([' Jvlmb) must be less than or 
equal to any Emaxd' Jvl) that would successfully eliminate a pair by he Equation 18 or 19. 

[01261 Since the minima and maxima of any given pair has been precalculated as outfined herein, a second speed- 



15 



EP 0 974 111 B1 



slopped: ^ ^'^'"'^''°"-^"*-»^"' 'hat satisfy either one of me following Criteria 



Equation 20 
Equation 21: 



the matrix need not be subjected to the evaluaS e ' ato^ l2 iL T f ^"'^ "^"^^ ""^^^^ «^ 
of a factor of four. tquat.on 18 or 19. resulting in a theoretical speed enhancement 

[0128J The last DEE speed enhancement refines the searrh «f • ■ 

c.st.c«n.ametnc^.e.e.mputede.em:ro:rtr^^^^^^^^ 

CLofr^xZa'ltjS.^Slf^^^ 

Sizes (see Figure 12) for ^P^^Tc^uTj:li:;Z:T ""IT ' ^^^"^ 
and Intervals were also computed as well as the Tffo !! k ! The size of the overiap of the IJ, 

maxima. Coml^nations of these'^.S.i'es^L weTa' m?"^^^^^ """'"'"^ ^ ^ '^'"--^ bleen tl^ 

occurrence of dead^nding pairs. Because s^e 3 Z ^a^l Z^T ^""^ P^"=t 

logarithmically. ^ '"^'""'^ very large, the quantities were also compared 

[0130] Most of the combinations were abfe tn nr^wi^t j» j 



Equation 22 

, Interval overiap ^ e^a, (tgi) - e„,„ ([/ ^ J) 
'"ten^((Vj) WIVJ) - fly jf 

Equation 23 
= inte'valoveriap ^ t„,A'M - e^i„([;yj ) 



[0131, T>,ese va..es are calculated using the minima and ma.ma equa«ons 24. 25. 26 and 27 (see Figure 14): 

Equation 24 

Equation 25 

®tan< f^\JsJ>=e(I/^jf^j)^ 2 inine([/ j j ) 



16 



EP 0 974111 B1 



Equation 26 



e« ( t' j 1 ) 



Equation 27 



[0132J These metrics were selected because they yield ratios of the occurrence of dead-ending matrix elements to 
the total occurrence of elements that are higher than any of the other metrics tested. For example, there are very few 
matrix elements { 2%) for which g„> 0,98, yet these elements produce 30-40% of all of the dead-ending pairs. 
[0133] Accordingly, the first-order douk)les criterion is applied only to those doubles for which > 0.98 and q > 
0.99. The sample data analyses predict that by using these two metrics, as many as half of the dead-ending elements 
may be found by evaluating only two to five percent of the reduced matrix. 

[0134] Generally, as is more fully desaibed below, single and double DEE, using either or both of original DEE and 
Goldstein DEE. is run until no further elimination is possible. Usually, convergence is not complete, and further elimi- 
nation must occur to achieve convergence. This is generally done using "super residue" DEE. 
[Q135J In a preferred embodiment, additional DEE compulation is done by the creation of "super residues" or "uni- 
fication", as is generally described In Desmet. Nature 356:539-542 ( 1 992); Desmet, ef aL, The Protein Folding Problem 
and Tertiary Structure Prediction. Ch. 10:1-49 (1994); Goldstein, eta!., supra. A super residue is a combination of two 
or more variable residue positions which is then treated as a single residue position. The super residue is then evaluated 
in singles DEE, and doubles DEE, with either other residue positions or super residues. The disadvantage of super 
residues is that there are many more rotameric states which must be evaluated; that Is, if a first variable residue position 
has 5 possible rotamers. and a second variable residue position has 4 possible rotamers, there are 20 possible super 
residue rotamers which must be evaluated. However, these super residues may be eliminated similar to singles, rather 
than being flagged like pairs. 

[01361 The selection of which positions to combine into super residues may be done in a variety of ways. In general, 
random selection of positions for super residues results in Inefficient elimination, but it can be done, although this is 
not preferred. In a preferred embodiment, the first evaluation is the selection of positions for a super residue is the 
number of rotamers at the posiUon. If the position has too many rotamers. it is never unified into a super residue, as 
the computation becomes too unwieldy. Thus, only positions with fewer than about 100.000 rotamers are chosen, with 
less than about 50.000 being preferred and less than about 10.000 being especially preferred. 
[0137J In a preferred embodiment, the evaluation of whether to fonm a super residue is done as follows. All possible 
rotamer pairs are ranked using Equation 28. and the rotamer pair with the highest number is chosen for unification: 



[01 38) Equation 28 is looking for the pair of posittons that has the highest fraclfon or percentage of flagged pairs but 
the fewest number of super rotamers. That Is, the pair that gives the highest value for Equation 28 is preferably chosen. 
Thus, if the pair of positions that has ttie highest number of flagged pairs but also a very large number of super rotamers 
(that is. the number of rotamers at position i limes the number of rotamers at position j). this pair may not be chosen 
(although it could) over a lower percentage of flagged pairs but fewer super rotamers. 

(01391 In an alternate preferred embodiment, positions are chosen for super residues that have the highest average 
energy; Uiat is. for positions j and j. the average energy of all rotamers for i and all rotamers for J Is calculated, and the 
pair with the highest average energy is chosen as a super resklue. 

[01401 Super residues are made one at a time, preferably. After a super residue is chosen, the singles and doubles 
DEE computations are repeated where the super residue is treated as if it were a regular residue. As for singles and 
doubles DEE. the elimination of rotamers in tiie super residue DEE will alter the minimum energy calculations of DEE 



Equation 28 



fraction of flagged pairs 



log 



(number of super rotamers resulting from the potential uniHcation) 



17 



EP 0 974 111 B1 



miTiiT^""^ ^y^^ *^ elimination of rotamers 

a single member, the global optimum. ^ optimized protein sequences contains 

[0143] a preferred embodiment, the various DEE sleos are run .inKior«.« u. 

rnent. direct calcuJafion of sequence energyl ou3al^v.?r '=°^«'«'"P'«. one embodi- 

tively. a Monte Carte eearchlan^,^ ^ ^''^ « done on the remainder possible sequences. Alterna- 

1087 (19M).he«byincorporatedSr^^ ri;,feemb^r^^^ "L" "^'"P""^ ^' ^'^«»- 21= 

IsciiosenasaslartpoinfroneemLment^evSuer^^^ 
residues and the set of available residuera?eSSo„:^^^^^^^^ 

a random rolarner for each amino add Is chosen This s«ve^ « f *^ sequence Is generated, and 
Monte Cario search then makes a r^^r^io^'JZ^^^^:^'''^^^'''^ °' ''""'^ A 
or a rotamer of a different amino acid. Z Z^^'nTs^^Tc^teTgWE ^ ^ * 

sequence energy meets the Bollzmann criteria for acceDfeinr«»i!.r» i * "=10^1 sequence) is calculated, and if the new 
Botemann test fails, another random lunTte^i^olS^ ^ ^'^"'"^ another jump. If the 

are found, to genl,^ a sL^^STC^Te^r"""- "^^^ 

quencesarealsooptimlzedprote^seqSc^SlJ^io^f^^^ "'^^ ^"t^^- These additional se- 

so as to evaluate the differences be^^me tt^fcal^d"^*^^^^ 

embodir„ent. the set of sequencesisatliTrbSSn^nlf^ "'^ ^'^"^^ = Preferred 

being preferred, at least about 85o/hoXtfbIfni '"f" * *^ ^ ^l^* 8°% homologous 

preferred. In some cases, homology as htTas So 

similarity or identty, with identity bing pre ferr^ WemLnn^is^^i °^J^. """^^^ ""^^"^ ^-'"-"<=- 
positions in the two sequences vvhich L bJiS^mo^^^^ 

identical and those which are s,milar (functSly^S^n ™«Z '^iT* '^^'^ 
niques known in the art. such as the Best l^iSXT^aSla^ d^^^^! '"^'^"^ "^'"^ 'ech- 
387-395 (1984). or the BUSTX program (AltschU efT/ J^^Z^^^r^^ ^ ^iieLAddRes, 12: 

settings for either. The alignment may ir^^me 'r^^^^^'^"^^^^ preferabfy^^ii^Fm^feult 
sequences which contain either more or feWer amira^S! tte^a^^ '^"^ ^"9"^- ^'««°". 
centageof homdogywin bedetemiined based on mT^umt!^h ^ " ''^°od that the per- 

of amino adds. Thus, for example, ho^ of sLTn^^ IZ^^ 

number of amino acids in the shorter sequence. ^ ™" "P*"™*" be determined using the 

SentSLSS'C^ireT^^^^^^^^ °' ^ op«ona.^ p^eeds .0 step 56 

seard. module 36 a set of computer'coTma.^e^"^^^^^^ Se.^The 
be written to execute a »4onte cL search as deSed a^vl Jl^^- ««a«=h module 36 may 

are dinged tootherrotamersaOowedattheirti^l^oosiL^ Lh ?^ ""^ global solution, random positions 
from different amino adds. A new sequenL elm^r '^^^^ 

meets the Boitzmann criteria for acceptance. I. is ui^i '^^tt^r^ S^T^]::^Z'SZi:'^. 



18 



EP 0 974 111 B1 



1953, supra, hereby incorporated by reference. If the Boltzmann test fails, another random jump is attempted from the 
previous sequence. A list of the sequences and their energies is maintained during the search. After a predetermined 
number of jumps, the best scoring sequences may be output as a rank-ordered list. Preferably, at least about 1 0^ jumps 
are made, with at least about 10? jumps being prefen-ed and at least about 10^ jumps being particularly preferred. 
Preferably, at least about 100 to 1000 sequences are saved, with at least about 10,000 sequences being preferred 
and at least about 100,000 to 1,000,000 sequences being espedally preferred. During the search, the temperature is 
preferably set to 1000 K. 

[0148] Once the Morite Cario search is over, all of the saved sequences are quenched by changing the temperature 
to 0 K, and fixing the amino acid identity at each position. Preferably, every possible rotamer jump for that particular 
amino add at every position is then tried. 

[0149] The computational processing results in a set of optimized protein sequences. These optimized protein se- 
quences are generally, but not always, significantly different from the wild-type sequence from which the backbone 
was taken. That rs, each optimized protein sequence preferably comprises at least about 5-10% variant amino acids 
from the starting or wild-type sequence, with at least about 15-20% changes being preferred and at least about 30% 
changes being particulariy prefen'ed. 

[0150] These sequerK:es can be used in a number of ways. In a preferred embodiment, one, some or all of the 
optimized protein sequences are constructed into designed proteins, as show with step 58 of Figure 2. Thereafter, the 
protein sequences can be tested, as shown with step 60 of the Figure 2. Generally, this can be done in one of two v^ys. 
[0151] In a preferred embodiment the designed proteins are chemically synthesized as is known in the art. This is 
particulariy useful when the designed proteins are short, preferably less than 150 amino adds in length, with less than 
100 amino adds being preferred, and less than 50 amino adds being particulariy preferred, although as is known in 
the art, longer proteins can t>e made chemically or enzymatically. 

[0152] In a preferred embodiment, particulariy for longer proteins or proteins for which large samples are desired, 
the optimized sequence is used to create a nucleic add such as DNA which encodes the optimized sequence arid 
which can then be doned into a host cell and expressed. Thus, nucleic adds, and particulariy DNA. can be made which 
encodes each optimized protein sequence. This is done using well known procedures. The choice of codons, suitable 
expression vectors and suitable host cells will vary depending on a number of factors, and can be easily optimized as 
needed. 

[01 53] Once made, the designed proteins are experimentally evaluated and tested for stmcture. function and stability, 
as required. This will be done as is known in the art, and will depend in part on the original protein from which the 
protein backbone stmcture was taken. Preferably, the designed proteins are more stable than the known protein that 
was used as the starting point, although in some cases, if some constaints are placed on the methods, the designed 
protein may be less stable. Thus, for example, it is possible to fix certain residues for altered biological activity and 
find the most stable sequence, but it may still be less stable than the wild type protein. Stable in this context means 
that the new protein retains either biological activity or conformatnn past the point at which the parent molecule did. 
Stability Includes, but is not limited to, thermal stability, i.e. an increase in the temperature at whteh reversible or ine- 
versible denaturing starts to occur; proteolytic stability, i.e. a decrease In the amount of protein which is Irreversibly 
deaved in the presence of a particular protease (induding autolysis): stability to alterations in pH or oxidative conditions; 
chelator stability: stability to metal ions; stability to solvents such as organic solvents, surfactants, formulation chemi- 
cals; etc. 

[01 54] In a prefen-ed embodiment, the modelled proteins are at least about 5% more stable than ttie original protein, 
with at least about 10% being prefen^ed and at least about 20-50% being espedally preferred. 
[0155] The results of the testing operations may be computationally assessed, as shown with step 62 of Rgure 2, 
An assessment module 38 may be used in this operatbn. That Is, computer code may be prepared to analyze the test 
data with respect to any number of metrices. 

[01 56] At this processing juncture, if the protein is selected (the yes branch at block 64) then the protein is utilized 
(step 66), as discussed below. If a protein is not selected, the accumulated information may be used to alter the ranking 
module 34. and/or step 56 Is repeated and more sequences are searched. 

(01 57] In a prefened embodiment, the experimental results are used for design feedback and design optimization. 
[0158] Once made, the proteins of the invention find use in a vwde variety of applications, as will be appredated by 
those in the art, ranging from industrial to phamiocological uses, depending on the protein. Thus, for example, proteins 
and enzymes exhibiting increased thermal stability may be used in industrial processes that are frequently mn at 
elevated temperatures, for example carbohydrate processing (induding saccharificatk>n and Ik^uKaction of starch to 
produce high fructose com syrup and other sweetners), protein processing (for example the use of proteases in laundry 
detergents, food processing, feed stock processing, baking, etc.), etc. Simllariy. the methods of the present invention 
allow the generation of useful pharmaceutical proteins, such as analogs of known proteinaceous dnjgs which are more 
thermostable, less proteolytically sensitive, or contain other desirable changes. 

[01 59] The following examples senre to more fully describe the manner of using the above-described invention, as 



19 



EP 0 974 111 B1 

All references cited herein are ex^idtfy mc^S^^^ST ''''^'"'^'"'''"''•'""^^P"^^^^ 
EXAMPLES 

Example 1 

Protein Design Using van der Waals and Atomic Solvation Scoring Functions with DEE 

of four components: a design paradigm a JZleT^L^ automaton (PDA) cyde is comprised 

parad«misbasedonmeconc5afrvT.:fX7arNaL^^^^^ "^^^ ^ "^^^ 

(1991)) and consists of the use of a fixed backZl nnTo^!^ ^ ^ Bowie, ef a/.. Science 253; 1 64. 17n 

where rotamers are the allo^lnfoTaJons^f am^nl ^ ^^"^^ °' ''"^'^ rol^i^Si^n be placed. 

tertiary Inte^cfionstesedc^rS^lTme^^^^^^^^^ {«^«))- 

wUI potentially best adopt the taroet fold rXpnTh^ri "^^ *° determine the sequences that 

residue posifcn as S. fte siSfii^ m^^" ? ^"'^ ^"'^^^ «"o^ for ead, 

function expDdti ^sSLT atom Zl^^JT " ' """'"^ °' ^""""""^ based on a 

bonecompnJof/r^uS^'S"^^^^^^ 

m"posslblearrangements of the svsfem an imnZT^T t . ^ rotamers of all allowed amino adds) results in 
50 rotamersat Is'poJ^^'rStrv^"^^^^^^ 

(far beyond cuirent capabilities) vvould te^eTo^^a^ "^rr ^- . " sequences per second 

andcha^cterizationofTsubsetni^ddXTci^p^:^^^^^ 
datafortheanalysismodule.TheanalvslssediordiIoo^»™ l^^u^^^^ 
stmctures and the experimental oS^T?^3rr 

simulauon and in some ca^s to me^SeIJn';^Lm S'" " h "T''' '"'""''^ modifications to the 
module describes a theoretical potenh^Zrav Ir^T^^^ ^ '^'T'^'- ^ "^^'^ ^^e simulation 

problem at hand. This Po JSl^S^ Sis rSTu^^^^'Zal^^ "TT" ^" '° 
is detem,ined from the experimental data In this c^SJ^j^ f^^ ^ P"'"""^' ^"^^^ ^ 

cost function In order to c^ate better mZZi^^n^ ^ analysis becomes the correction of the simulafion 
corrections can be found, then the TuJ^ srs^rsi^tl^T^'^^K '"^^ " 

the target properties. Thi^ design Sfei^^SiS^l^ """"""'^^ '^''^^ ^'^hieve 

human component, allows a i?Jnt."^^pSto S^^'L^^^^^ T'' """^'"^ '^''j-*'- 

[0161J The PDA side-chain selerKnn :,ln™i.h™ «! • • ^ ' '• ' P^"'^" automation. 

taskofdesignlngasequtTef^M^rtS^^^^ 

chains relative to the given backbone. It Is not sJ^^^to^^^^^^r^"^^"' "'^"^'"^ ^'"^^ 
sequences. In order to correctly account for the 9e<Zmc^^^^^^ 1 ^ ^^^'"««"9 
tions of each side chain must also be examined Sto^H^if!^"^^.!^^^^'" P'^-^"^"*- all possible conforma- 
supra) have defined a discrete set of anow^^nfSiaS^rj^ "T" '^^'"'^^^ <''°"''-- 

a rotomer libraiy based on the Ponder and SZdS^ ?r T"^' ^^^i"- ^e use 

10162] UsingarotamerdescriSTle^MSaZbnlf'^^'^ 

all possible sequences of rotame,^ Z^re ^ b^Z^T'^'^ ^ ^ "^''^ ^^^^ 
possible rotameric states The d^^rnire rS^r '^"^ ^"'^ ^ in all ite 

sequences to be tested. A backbone S Mn^^mZ^l T ' "'^'^ °^ ""^'^^ '^iam^ 

sequences. The size of the sear^hTp^ce grows ex»a f ^r^'' "^^^ -^^"^ 
m render Intractabte an exhaustive seard," TOs ^h,^ sequence length whid, for typical values of n and 

the simulation phase of PDA "»no.nau,nai "exploSwn- is the primary obsfade to be overcome in 

lem. The DEE theorem is the basis'fl a ZS'^^T^^^S^ f"" ^««=h prot. 

chains on a fixed badione with a known seouen^ sT!!. "^""^ pad< protein side 

is used to score rotamer arrangements M^tt,!! f "^"^"^'^ "^""^^ «««"fetlc forcefield 

timum pacWngis found. T^of^m^^^l^^^Z'd!!!^ 1 ^*^9loba,op. 

constraint that a position Is limit^ to thetteStSni e^^^^ -"verse folding design paradigm by releasing Z 
number of rotamers at ead, position and Suims a ^am^Z ^^r T T"*""" ^'^"'^ '"^^ ^ 
aescribedm^.„yh.ein.Th%ua.n.eetJryLlSX^^^^^^^ 



20 



EP 0 974 111 B1 



means that the globally optimal sequence Is found In Its optimal conformation. 

[0164] DEE was implemented with a novel addition to the improvements suggested by Goldstein (Goldstein. ( 1 994) 
(supra)). As has been noted, exhaustive application of the R=1 rotamer elimination and R-0 rotamer-pair flagging 
equations and limited application of the R-1 rotamer-palr flagging equation routinely fails to find the global solution. 
This problem can be overcome by unifying residues into "super residues'* (Desmet. et aL, (1992( (supra); Desmet. ef 
a/.. (1994) (supra); Goldstein. (1 994) (supra). However, unification can cause an unmanageable increase in the number 
of super rotamers per super residue position and can lead to intractably slow performance since the computation time 
for applying the R=1 rotamer-pair flagging equation increases as the fourth power of the number of rotamers. These 
problems are of particular importance for protein design applications given the requirement for large numbers of ro- 
tamers per residue position. In order to limit memory size and to increase perfomnance. we developed a heuristic that 
governs which residues (or super residues) get unified and the number of rotamer (or super rotamer) pairs that are 
included in the R=1 rotamer-pair flagging equation. A program called PDA^DEE was written that takes a list of rotamer 
energies from PDA.SETUP arKi outputs the global minimum sequence in its optimal conformation with its energy. 
[0165] Scoring functions: The rotamer library used was similar to that used by Desmet and coworicers (Desmet, 
ef aL, (1992) (supra)). Xi and X2 anQ'e values of rotamers for all amino acids except Met. Arg and Lys were expanded 
plus and minus one standard deviation about the mean value from the Ponder and Richards library (supra) &i order to 
minimize possible errors that might arise from the discreteness of the library. C3 and C4 angles that were undetermined 
from the database statistics were assigned values of 0* and 180' for Gin and 60°. -60* and 180* for Met, Lys and Arg. 
The number of rotamers per amino acid Is: Gly. 1; Ala, 1; Val. 9; Ser. 9; Cys. 9; Thr, 9; Leu, 36; lie. 45; Phe, 36; Tyr, 
36; Trp. 54; His. 54; Asp. 27; Asn, 54; Glu, 69; Gin, 90; Met. 21; Lys, 57; Arg. 55. The cyclic amino acid Pro was not 
Included in the library. Further, all rotamers in the library contained explicit hydrogen atoms. Rotamers were built with 
bond lengths and angles from the Dreiding forcefield (Mayo, et aL, J. Phys. Chem. 94:8897 (1990)). 
[0166] The initial scoring function for sequence arrangements used in the search was an atomic van der Waals 
potential. The van der Waals potential reflects excluded volume and steric packing interactions which are important 
defemiinants of the spedfic three dimensional arrangement of protein side chains. A Lennard- Jones 12-6 potential 
with radii and well depth parameters from the Dreiding forcefield was used for van der Waals interactions. Non-bonded 
interactions for atoms connected by one or two bonds were not considered, van der Waals radii for atoms connected 
by three bonds were scaled by 0.5. Rotamer/rotamer pair energies and rotamer/template ener^es were calculated In 
a manner consistent with the published DEE algorithm (Desmet, et a/., ( 1 992) (supra)). The template consisted of the 
protein backbone and the side chains of reskiue positions not to be optimized. No intra-slde-chain potentials were 
calculated. This scheme scored the packing geometry and eliminated bias from rotamer internal energies. Prior to 
DEE, all rotamers with template interactton energies greater than 25 kcal/mol were eliminated. Also, any rotamer whose 
interaction was greater than 25 kcal/mol with all other rotamers at another residue position was eliminated. A program 
called PDA^SETUP was written that takes as input backbone coordinates. Including side chains for positions not op- 
timized, a rotamer library, a list of positions to be optimized and a list of the amino adds to be conskiered at each 
position. PDA_SETUP outputs a list of rotamerAemplate and rotamer/rotamer energies. 

[0167] The palrwise solvation potential was Implemented in two components to remain consistent with the DEE 
methodology: rotamer/template and rotamer/rotamer burial. For the rotamer/template buried area, the reference state 
was defined as the rotamer in question at residue / with the backbone atoms only of residues A1. / and i+1. The area 
of the side chain was calculated with the backbone atoms excluding solvent but not counted In the area. The folded 
state was defined as the area of the rotamer in question at residue ;"; but now In the context of the entire template 
structure induding non-optimized side chains. The rotamer/template buried area Is the difference between the refer- 
ence and the folded states. The rotamer/rotamer reference area Is simply the sum of the areas of the isolated rotamers. 
The folded state is the area of the two rotamers placed in their relative positions on the protein scaffold but with no 
template atoms present. The Richards definition of solvent accessible surface area (Lee & Richards. 1971 , supra) was 
used, with a probe radius of 1.4 A and Drieding van der Waals radii. Carbon and sulfur, and all attached hydrogens, 
were considered nonpolar. Nitrogen and oxygen, and all attached hydrogens, were considered polar. Surface areas 
were calculated with the Connolly algorithfn using a dot density of 10 A-2 (Connolly, (1983) (supra)). In more recent 
implementations of PDA_SETUP. the MSEED algorithm of Scheraga has been used in conjunction with the ConnoHy 
algorithm to speed up the calculation (Perrot, ef aL, J. Comput. Chem. 13:1-11 (1992)(. 

[0168J Monte Cario search: Following DEE optimization, a rank ordered listof sequences was generated by a Monte 
Carto search in the neighborhood of the DEE solution. This list of sequences was necessary because of possible 
differences between the theoretical and actual potential surfaces. The Monte Carto search starts at the global minimum 
sequence found by DEE. A residue was picked randomly and changed to a random rotamer selected from those allowed 
at that site. A new sequence energy was calculated and, if it met the Boltzman criteria for acceptance, the new sequence 
was used as the starting point for another jump. If the Boltzman test failed, then another random jump was attempted 
from the previous sequence. A list of the best sequences found and their energies was maintained throughout the 
search. Typically 1 0^ jumps were made, 1 00 sequences saved and the temperature was set to 1 000 K. After the search 



21 



EP 0 974 111 B1 



PnA"fwinMTc , sequences rank ordered by their score PDA SETUP PDA npp 

and dimeric tertiary organization ease SSSn '^'^'^"'q^es and their helical secondary structure 

^.ed a heptad r^ea.^ 'a-... J.^S^^;;:Src%l^^^^^^^ 

tallographically determined fixeTfiS^5Si^2| J^^S^ ' "'"'""^^'^ 

rotamers from hydrophobic amino acids (A^r M ^,^^^^°"'°^J^^^o^ symmetry was enforced, only 

Asn 16. was not optimized ^^^■'■'''**''^^^^^^~'«««^««*«"<l«f««Paragi^ 

left in .heir c^s.al,ogr;phicallyde,l^^^^^^ 

Diego, CA) was used to generate exDlidthvdr^rr^ hr^S ^P^"^ (Biosym/Molecular Simulations. San 
50 steps using the Dremg S^^ i^^^H^^^^ 

.he rotamer groups for meVn^z^^a^r^ d p^^^'^e allowing hydrophobic amino acids into 

Phe. Tyr and Trp for a total of 238 rotamera dT^S^ M """^^ted of Ala. Val. Leu. lie. Met. 

kca^mol rotam^ pairs that violate eS^^m'^rD^:n7o^^^^^ f^^"^'"^ 
symmetry related positions The asr,ar^mn^ • T rotamers of the same amino add were allowed at 

optimized. A 106 sterM^teirtoS^^^ 

rank ordered by their score Tctel" fe^S^dWlt Jlt.'^L ^'^^^ of candidate sequences 

see<fe and all tnals prow J^e^S^^B^^^^^^ 

SrXr'lh: l\l aTgllrn^thr b? ? ^^^^^ -^.s m 2381s or 1033 

time. The^E solution mSr^^^^ 

SeS ;";r^r — 

quences with a range 01 staW(iti« w«e Ji!^»?fr ^ ^"^"^"^ '"^ ^ coil. Eight se- 

about 15 ^c^frnol'S^.rTJZZ S^:^ S^'Se "t "^"''"^ ""^ ^ = ^"^ 



22 



EP 0 974 111 B1 



TABLE 1 



Name 


Sequence 


Rank 


Energy 




RMKQLEOKVEELLSKNYHLENEVARLKKLVGER 


1 


-118.1 


PDA-^A 


RMKQLEDKVEELLSKNYHLENEVARLKKUGER 


2 


-115.3 


P0Ar3G 


RMKQLEDKVEELLSKNYHLENEMARLKKLVGER 


5 


-112.8 


PDA.3B 


RLKQMEDKVEELLSKNYHLENEVARLKKLVGER 


6 


-112.6 


PDA-3D 


RLKQMEDKVEELLSKNYHLENEVARLKKLAGER 


13 


-109.7 


PDA-3C 


RMKQWEDKAEELLSKNYHLENEVARLKKLVQER 


14 


-109.6 


PDA.3F 


RMKQFEDKVEELLSKNYHLENEVARLKKLVGER 


56 


-103,9 


PDA-3E 


RMKQLEDKVEELLSKNYHAENEVARLKKLVGER 


70 


-103.1 



[01741 Thirty-three residue peptides were synthesized on an Applied Biosyslems Model 433A peptide synthesizer 
using Fmoc chemistry. HBTU acUvation and a modified Rink amide resin from Novablochem. Standard 0.1 mmol cou- 
pling cycles were used and amino termini were acetylated. Peptides were cleaved from the resin by treating approxi- 
mately 200 mg of resin with 2 mL trifluoroacetic add (TFA) and 100 ^iL water. 100 ^iL thioanisole. 50 ^L elhanedithiol 
and 150 mg phenol as scavengers. The peptides were isolated and purified by predpitation and repeated washing 
with cold methyl lert-butyl ether followed by reverse phase HPLC on a Vydac C8 column (25 cm by 22 mm) with a 
linear acetonltrile-water gradient containing 0.1% TFA Peptides were then lycphilized and stored at -20 ^'C until use. 
Plasma desorplion mass spectrometry found all molecular weights to be within one unit of the expected masses. 
[01 75] Circular dichroism CD spectra were measured on an Aviv 62DS spectrometer at pH 7.0 in 50 mM phosphate, 
150 mM NaCI and 40 ^M peptide. A 1 mm pathlength cell was used and the temperature was controfled by a thermo^ 
electric unit. Thermal melts were perfomned in the same buffer using two degree temperature increments with an 
averaging time of 1 0 s and an equilibration time of 90 s. T^ values were derived from the elliptldty at 222 nm ([0)222) 
by evaluating the minimum of the d[ei222/dT-i versus T plot (Cantor & Schlmmel, Biophysical Chemistry. New York: W. 
H. Freemant and Company, 1980). The were reprodudble to within one degree. Peptide concentrations were 
determined from the tyrosine absorbance at 275 nm (Huyghues-Despolntes, et af., supra). 

[0176J Size exclusion chromatography: Size exdusion chromatography was performed with a Synchropak GPC 
100 column (25 cm by 4.6 mm) at pH 7.0 in 50 mM phosphate and 150 mM NaCI at 0 "C. GCN4-p1 and p-LI (Harbury, 
ef a/.. Science 262:1401(1993)) were used as size standards. 10 ^l injections of 1 mM peptide solution were chroma^ 
tographed at 0.20 ml/min and monitored at 275 nm. Peptide concentrations were approximately 60 \iM as estimated 
from peak heights. Samples were run in triplicate. 

[01771 The designed a and d sequences were synthesized as above using the GCN4-p1 sequence for the and 
efg positions. Standard solid phase techniques were used and following HPLC purification, the Identities of the pep- 
tides were confirmed by mass spectrometry. Circular dichroism spectroscopy (CD) was used to assay the secondary 
stmcture and themial stability of the designed peptides. The CD spectra of all the peptides at 1 'C and a concentration 
of 40 mM exhibit minima at 208 and 222 nm and a maximum at 195 nm, which are diagnostic for a helices (data not 
shown). The eiliplidty values at 222 nm indicate that all of the peptides are >85% helical (approximately -28000 deg 
cm2/dmol). witii the exception of PDA-3C which is 75% helical at 40 mM but increases to 90% helical at 170 mM (Table 
2). - - 



23 



EP 0 974 111 B1 



Table 2. 



CD data and calculated structural properties of the PDA peptides. 



Name 


-[01222 








(deg 


ro 


(kcal/ 




cm2/ 




mol) 




dmof) 




PDA- 


33000 


57 


- 


3H 






118.1 


PDA- 


30300 


48 




3A 






115.3 


PDA- 


28200 


47 




SB 






112.6 


PDA- 


30700 


47 




3G 






112.8 


PDA- 


28800 


39 




3F 






103.9 


PDA- 


27800 


39 




3D 






109.7 


PDA- 


24100 


26 




3C 






109.6 


PDA- 


27500 


24 




3E 






103.1 



(A) 



2967 

2910 

2977 

3003 

3000 

2920 

2878 

2882 



AA^ 
(A) 



2341 

2361 

2372 

2383 

2336 

2392 

2400 

2361 



Vol 
(A3) 



1830 

1725 

1830 

1878 

1872 

1725 

1843 

1674 



Rot 
bonds 



28 
26 
28 
32 

28 

26 

26 

24 



*tMc is tne Morite uario energy; AA„p and Mp' are the changes in 



^0 

(kcal/ 
mol) 


(kcal 
/md) 


EvdW 

(kcal/ 
mol) 


Npb 


-234 


-308 


409 


207 


-232 


-312 


400 


203 


-242 


-306 


379 


210 


-240 


-309 


439 


212 


-188 


-302 


420 


212 


-240 


-310 


370 


206 


-149 


-304 


398 


215 


-179 


-309 


411 


203 



Pb 



128 

128 

127 

128 

128 

127 

129 

127 



surface areas upon folding, respectively; Ecq is the electrostatic 



accessible non-polar and polar 



energy; Vol is the side chain van der Waals 

Pb are the number 



ferences from (XW.p^Tr^S^T^^^.^J^^ ' ""^^ "°' *° """"^er of sequent d«L 

«^th the appearance of ^^iS^^^^t^^l^^ T""'' ^ ™^ 

jo.s,vtofavor.,ne.zLoveror^^^ 

rreSr;ir.:s^^^^^^ 

the focus of further s on thes^e^Z TrS? '""^ "'."'""'"^ °' '^'^'V Packing are 

chernica.exchengeer,dre^;;l^hS^^^^^ 

istry 29:2891 (1990)); Goodmar, & Kim. Bo^JTm^t^S^^ (unpublished results) (Oas, e,a/.. Biocherr,: 

..^est correla..^. ^^^Z^^Z'r^T:^.^^^^^ '^-^ 
Ships (QSAR). whteh^lsrstr^^^n^^^J^e^^^^^^^ 

Chem. 28:1133 (1985)). "mmoniy used in structure based drug design (Hopfinger. J. Med. 

[0182] Table 2 lists various molecular DroDertieqnfthpPnA«*.r^»:^« - 

carlo scores arKl the experimer,lally deteZn2l T °1 A ^""^ '° ^^^'^ *^ ^onte 

^ memany determined T„ s. A wide rar.ge of properties was examined, including molecular 



24 



EP 0 974 111 B1 



mechanics components, such as electrostatic energies, and geometric measures, such as volume. The goal of QSAR 
is the generation of equations that closely approximate the experimental quantity. In this case T^, as a function of the 
calculated properties. Such equations suggest which properties can be used in an improved cost function. The PDA 
analysis module employs genetic function approximation (GFA) (Rogers & Hopfinger. J, Chem. Inf. Comput. Scie. 34: 
854 (1994)), a novel method to optimize QSAR equations that selects which properties are to be included and the 
relative weightings of the properties using a genetic algorithm. GFA accomplishes an efficient search of the space of 
possible equations and robustly generates a list of equations ranked by their con-elation to the data. 
[0183] Equations are scored by lack of fit (LOF), a weighted least square error measure that resists overfitting by 
penalizing equations with more temns (Rogers & Hopfinger. supra). GFA optimizes both the length and the composition 
of the equations and. by generating a set of QSAR equations, clarifies combinations of properties that fit well and 
properties that recur in many equations. All of the top five equations that correct the simulation energy (E^c) contain 
burial of nonpolar surface area, AA„p (Table 3). 

Table 3. Top five QSAR equations generated by GFA with LOF, correlation coefficient and cross 
validation scores. 



QSAR equation LOF r* CV r* 



-1.44*Emc + 0.14*AA^ - 0,73*Npb 16.23 .98 .78 



-1 .78*Emc + 0.20*AA„p - 2.43*Rot 23.13 .97 .75 



1.59*Emc + 0,17*AA^ - 0.05'^ Vol 24.57 .97 .36 



-1.54*E^,c + 0.11*AA^ 25.45 .91 .80 



-1 .60»Emc + 0.09*AA^ - 0. 12*AAp 33.88 .96 .90 



and are nonpolar and polar surface buried upon folding, respectively. Vol is side chain 
volunie. Npb is the number of buried nonpolar atoms and Rot islhe number of burled rotatable bonds. 



[01841 The presence of M^p in all of the top equations, in addition to the low LOF of the QSAR containing only E^c 
and AA^, strongly implicates nonpolar surface burial as a critical property for predicting peptide stability. This conclusion 
is not surprising given the role of the hydrophobic effect in protein energetics (Dill. Biochem. 29:7133 (1990)). 
[0185] Properties were calculated using BIOGRAF and the Drelding forcefield. Solvent accessible surface areas 
were calculated with the Connolly algorithm (Connolly. { 1 983) (supra)) using a probe radius of 1 .4 A and a dot density 
of 10 A'2. Volumes were calculated as the sum of the van der Waals volumes of the side chains that were optimized. 
The number of buried polar and nonpolar heavy atoms were defined as atoms, with their attached hydrogens, that 
expose less than 5 A2 in the surface area calculation. Electrostatic energies were calculated using a dielectric of one 



25 



EP 0 974 111 B1 



and no cutoff was set for calculation of non-bonded enemiBc rh=.„«™ -ii. ^ 

Phys. Chem. 95:3358 (1991) and Gast»ia2rTr»!» Ti! equilibration charges (Rappe & Goddard III. J. 

generate electrostatic ene^gLrCh^rSS^^ ^''''^ ^«^=»219 (1980) charges were used to 

and neuua. s.de chains in .^er .o prevlnTsJ^^^^^^^^ '^ckbones 
requirement that properties could not be highly corS Co^eS; f?'^*' ^ """"^ ^ 

techniques and only create redundancy in the de^^rS.^^ ""^^ *^ differBntialed by QSAR 

101861 Genetic function approximation (GFA) was performed in the CERiii<49=i»-ioB 

sym/Molecular Simulations. San Dieqo CAl Aninitiai^vv!.^^ - P**ag» version 1.6 (Bio- 

combinations of three propertiesS ii^ar^rr^Ti^^^ 

regression for each set of^roperti^s (^S^darteZ^nn 7 "Efficients we,« determined by least squares 
overmutations wereperfoLrJa chJd^aSa^^^^^^^ 

the wors. equation. L, mut^^'on oplt^^ m^T^ ^^^^^^^^ .hech..drep^ed 
gerteration. but these mutations were only accented if J. Zr^l, . pmbabilily of being applied each 

terrns was allowed. Equations were sc^ll'^jLv^^^^^^ 

error (LSE) measure that penaPzes equat^ns i more .ennstS^lTe'^rSlSe^^^t^^rci:^^^^ 



(1 . 2C,2- 



considered which resulted in m"X^Z ^"^"^ 

question were rem. This t^w eqTaZLTSn us^ to p^Z t TJt "'^""'""^'^ 

had been predicted in this maruTer ttSr^reS^S^ ^T*? the w t^held data point. When all of the data points 

QSAR and the E|^/A^. JM. OSAR^^^ ""^^^"'^ ^"-'s was computed (Table 3). Only the 

. «.u J . MC«»^p'i»ApUi>ARperfomied well in cross validation ThBF. ma «.™,,fi„„ u "^MC'^^^np 

to fit the data as smoothly as QSAR's wHh three teimo anrt h^ZT^ :. , ^^'^'> equation could not be expected 

two tem, QSAR's had LOF scoresqreate^ttenS^" r! « ^ "^"^ a" other 

The QSAR analysis indepeZTyS^ vStI n^^^^^^^ r^"'.'""" '^'^ " ^'''^ 

area burial is n^ssary ^mZi^te SSli^if,^« ' ^ consideration of nonpolar and polar surface 

These results justify the cost c^SS aS ''^ 0^^^^ ^"^'V^*- 

shown to perfom, well (van Gunsteren. i'^.T^!1^''^SS9^S^^^^^ ^'^"'^ ""•^•^'^ 

Sor^^csizjrri^^ 

dependenuycoun^ng buried surfarfrSrt^r^^^^ 

Of bunal because the radii of solvent accessible ^i.rfa^oc u . ^^E, leads to overestimation 

hence can overlap greatly in a .^X^^prote n 2^^^^^ f -•act radP and 

were reticulated using the pairwise area melhod an^nl^E^M Tz^^oSS'LT'^ "'"^ ''^ ^^^-^ 

of the coefficient to the 4A„„ and AA coefficients arp ciSl^ generated. The raMos 

convert buried surface a,Ba into^nergy ie-a^Tc soI^h^ ^™ f ^? ^ "'°'^'^'e »<> 

costfunct.on(Figure6B).lnaddition.theimplX1,SS^S^^ 

as the ground state. The surface area to en^i^l^e fS ffi'^^Sa^ naturally occumng GCN4-p1 sequence 
n.ol/A2 opposing polar area burial, are similaTn ^^:^r^^ ^TT^ ^'^^ "^"al and 86 cal/ 

dem.ed from small molecule transfer data J^^S Snb^g "'^"^^ P^^"-*- 

pn^^i ^^'^'^ (Lr^S^uer J Mo B^. 'mls'^SjJ been extensively chara^ 

PDB file 1LMB (Beamer & Pabo. J. IVlol Biol 227-177 f M2» m! i^f ^ •J'T'P'^'* «»rtinates were taken from 
removed from the context of the rest of ttie rtr.;rt„;17J ^ ^""^ designated chain 4 in the PDB file was 
hydrogens were added The hy^SoLc resS^rL^^'"^^''^ ^ '^'"S BIOGRAF explicit 

V47) are Y22. L31 . A37. mL '^"oTs l^^^^^^ f^rM^TT ' ™«««°" ^'^^ 0/36 'm40 

except for M42 which Is 65% buried arKl wh c Jis ?5% tZ A^l^ ^^'^ ""-"^ 
no.op«mi.ed.Theo.erninere.duesin.e5Asi:rtUra:^;°^^^^^^^ 



26 



EP 0 974 111 B1 



add ("floated"). The mutation sites were allowed any rotamer of the amino acid sequence in question. Depending on 
the mutant sequence, 5 x 10'^ to 7 x 10^^ confonrnations were possible. Rotamer energy and DEE caiculation times 
were 2 to 4 minutes. The combined activity score is that of HelHnga and Richards (Hellinga. et af., (1994) (supra)). 
Seventy-eight of the 125 possible combinations were generated. Also, this dataset has been used to test several 
computational schemes and can sen^e as a basis for comparing different forcefields (Lee & Levitt, Nature 352:448 
(1991); van Gunsteren & Mark, supra; Hellinga. et a/„ (1994) (supra)). The simulation module, using the cost function 
found by QSAR, was used to find the optimal conformation and energy for each mutant sequence. Ad hydrophobic 
residues within 5 A of the three mutation sites were also left free to be relaxed by the algorithm. This 5 A sphere 
contained 12 residues, a significantly larger problem than previous efforts (Lee & Levitt, supra; Hellinga, (1 994) (supra)), 
that were rapidly optimized by the DEE component of the simulation module. The rank correlation of the predicted 
energy to the combined activity score proposed by Hellinga and Richards is shown in Figure 7. The wildtype has the 
lowest energy of the 1 25 possible sequences and the correlation is essenb'ally equivalent to previously published results 
which demonstrates that the QSAR corrected cost function is not specific for coiled colls and can model other proteins 
adequately. 

Example 2 

Automated design of the surface positions of protein helices 

[01911 GCN4>pl. a homodimeric coiled coll, was again selected as the model system because it can be readily syn- 
thesized by solid phase techniques and its helical secondary structure and dimeric tertiary organization ease charac- 
terization. The sequences of homodimeric coiled coils display a seven residue periodic hydrophobic and polar pattern 
called a heptad repeat, (a-b-c-d-e f g) (Cohen & Parry, supra). The a and d positions are buried at the dimer Interface 
and are usually hydrophobfc, whereas the b, c, e, f, and g positions are solvent exposed and usually polar (Figure 5). 
Examination of the crystal structure of GCN4-p1 (O'Shea. et al., supra) shows that the b, c. and f side chains extend 
into solvent and expose at least 55% of their surtece area. In contrast, the e and g reskiues bury from 50 to 90% of 
their surface area by packing against the a and d residues of the opposing helix. We selected the 12 b, c, and f residue 
positions for surface sequence design: positions 3, 4, 7, 10, 11, 14, 17, 18, 21. 24, 25. and 28 using the numbering 
from PDB entry 2zta (Bernstein, et aL, J. Mol. Biol. 112:535 (1977)). The remainder of the protein structure, including 
all other side chains and the backbone, was used as the template for sequence selection calculations. The symmetry 
of the dimer and lack of interactions of surface residues between the subunits allowed independent design of each 
subunit, thereby significantly reducing the size of the sequence optimization problem. 

[0192] All possible sequences of hydrophilic amino acids (D. E, N. Q, K, R, S, T. A, and H) for the 12 surface positions 
were screened by our design algorithm. The torsk>nal flexibility of the amino add side chains was accounted for by 
considering a discrete set of all allowed conformers of each side chain, called rotamers (Ponder, et aL, (1987( (supra); 
Dunbrack, et aL, Struc. Biol. Vol. 1(5):334-340 (1994)). Optimizing the 12 b, c, and f positions each with 10 possible 
amino adds results in 10^2 possible sequences which corresponds to 1028 rotamer sequences when using the Dun- 
brack and Karplus backbone-dependent rotamer library. The immense search problem presented by rotamer sequence 
optimization is overcome by application of the Dead-End Elimination (DEE) theorem (Desmet. et aL, (1992( (supra); 
Desmel, ef aL, (1994) (supra); Goldstein. (1994) (supra)). Our implementation of the DEE theorem extends its utility 
to sequence design and rapidly finds the globally optimal sequence in its optimal conformation. 
[0193] We examined three potential-energy functions for tfieir effectiveness in scoring surface sequences. Each 
candidate scoring function was used to design the b, c, and f positions of the model colled coil and the resulting peptide 
was synthesized and characterized to assess design performance. A hydrogen-bond potential was used to check if 
predicted hydrogen bonds can contribute to designed protein stability, as expected from studies of hydrogen bonding 
in proteins and peptides (Stickle, ef a/., supra; Huyghues-Despointes, et aL, supra). Optimizing sequences for hydrogen 
bonding, however, often buries polar protons thai are not Involved In hydrogen bonds. This uncompensated loss of 
potential hydrogen-bond donors to water prbmpted examination of a second scoring scheme consisting of a hydrogen- 
bond potential in conjunction with a penalty for burial of polar protons (Eisenberg, (1986) (supra)). We tested a third 
scoring scheme which augments tiie hydrogen bond potential with the empirically derived helix propensities of Baldwin 
and coworkers (Chakrabartty. ef aL, supra). Although the physical basis of helix propensities is undear. they can have 
a significant effect on protein stability and can potentially be used to improve protein designs (O'Neil & DeGrado. 1990; 
Zhang, et aL, Blochem. 30:2012 (1991); Blaber. et aL, Science 260:1637 (1993); 0*shea, ef aL, 1993; Villegas, et aL, 
Folding and Design 1:29 (1996)). A van der Waais potential was used in all cases to account for packing interactions 
and excluded volume. 

[0194] Several other sequences for the b, c and f positions were also synthesized and characterized to help discern 
the relative importance of the hydrogen-bonding and helix-propensity potentials. The sequence designed with the 
hydrogen-bond potential was randomly scrambled, tiiereby disrupting the designed interactions but not changing ttie 



27 



EP 0 974 111 B1 



and lack of interactions of surface residues between lha s..h.,rZ^'i tT' ^^sv^f^etry of thedimer 

compu^fcns were done u.ng the firs. mono^teXpe^S f^j^t^'"' "T^ °' A" 
was used (Dunbrack, e/ a/. (1993) (supra)) c, anqles that werf^^T ^>J^ '«'='*«»^ependent rolamer library 
signed the following values: Arg, ^ ^- an^ ^M=^n ^ ^ '^'^^ ««^«««^ ^re al 

Lys. -eO". 60'. and 180». c, angles oiat werur^SeteSed fro;,r*H / T' e^-- 120°.- 

resulted in sequential gVg- or gVg> angles were elllated' Und^f L Sfs^^ "^"^^ ^ ^nd c, that 
12-6 potential with van der Waals radii scaled bv 0 9 (D^iv^t IZ^ c . ^ f ^ Lennard-Jones 

byCaltechsdenlists.newpressrelease(lS7Ss^1^L^^^^^^^ 
oonsistedofadistance-dependentterrnaldaning^^^^^^^^ 

bond potential is based on the potential used in DREIdE v^^Z^rrS r^"^ above. This hydrogen 
occurrence of unfavorable hydrogen bond geometries The a^ « .«™ J T ^"^P^^ terms to limit the 
the donor and acceptor, as sho«; in EquaS to i 3 abS °" "'^ hybridization state of 



M09.5-<90'for.hesp3donor-sJ;XSr?nd' "^^^^^^ 

functions were used. Template donors and acceptor^ tha wJ. JZ 1 . ' ^ "° switching 

not included in the do«)r and acceptor li^^ iTp^J^! of eSn^ '^P'^^'^'^P'-t^ Mrogen bonds we^ 
considered to exist When ^5Ai/?^3.3Aande>T3^Cn^,^^^^^^^^^^ 

was only applied to buried pdar hyd«,gens not Involvtj i htS.n ^ "^ed. 

to exist when was less ftan -2 kcS TO^^^^f .^''^^ « hydrogen bond was considered 
potential was also supplemented with T^e^k^ZZZZS^ h T."'" ^^r^ 
of Where R is the interatomic distance PaSttoSc Zr™. ^ 1 ' '^'^'^"ce^ependent dielectric constant 
fom,al charge of *1 was used for Arg and and / '° '^"^ "^"^ ^"^P^" ^ 

associated w,th a-helical propensities were SLlate^ SnS^S^I k ""T "^^^ ""^ Energies 

a-helical Propensity. AG-„ is the standard free ^eSSS^^^^^ ^'^^ °f 

free energy of helix propagation of alanine used a^^ staS^^^ m i ^"^ ^G",,, is the standard 

to 3.0, This potential was selected in order to^le the^^n2;v 2^ . * "^"P"""'*^ was set 

scoring funcf on. The DEE optimization fo ll^tL l^s of o^" ' "^^^^ the 

[0197J Calculations were perfom,ed on e^^2^^M^^^r n''*' ^' • ^''''^ 
a 512 node Intel Delta. processor, RlOOOO-based Silicon Graphics Power Chanenga or 

[0198] Peptide synthesis and purification anH rn or.,i>„- . - 

90/10 H,0/D,0 and 50 mM sodiuTpChatetSer aSTol^t ^""^ P-P-ed in 

spectn»„eter at 25 »C. 32 transients were acJuSl^tiTs s^S"^^ °" ^ """^P'"^ ^Hz 

sion. Samples were 1 mM. Size exclusion Sro^af^;.nhl ^ Presaturatwn used for water suppres- 

on X 9 mm) a. pH 7.0 in 50 mM phcsZe a^dTo 2^^^^^^ ^ ^""^^ *^->-y^^y^ A colum'n (?o 

used as size standards for dimer and tetramer resp^ivTs !l inL f ""^ ^P^^) were 

equences of all of the peptides examined in this study are shown in Table 4. 



28 



EP 0 974 111 B1 



Table 4. Sequences and properties of the synthesized peptides 



Peptide 


Design method 


Surface Sequence 


T« 




N 






bcfbcfbcfbcf 




CC) 




6CN4-p1 


none 


KQD EES YHN ARK 


57 


3.831 


2 


6A 


HB 


EKD RER RRE RRE 


71 


Z193 


2 


6B 




EKQ KER ERE ERQ 


72 


2.866 


2 


6C 


HB^HP 


ARAAAARRRARA 


69 


-2.041 


2 


6D 


scrambled HB 


REE RRR EDR KRE 


71 


2.193 


2 


6E 


random polar 


mn AKS ANH NTQ 


15 


4.954 


2 


6F 


poV(Ala) 


AAA AA^ AAA AAA 


73 


-3.096 


4 



For darity only the designed surface residues are shown and they are grouped by position (b, c, 
and f). The sequence numbers of the designed positions are: 3, 4, 7. 10. 11. 14, 17, 18. 21, 24,* 
25. and 28. Melting temperatures (T„*s) were detemiined by drcular dtchroism and * 
oligomerization states (N) were determined by size exclusion chromatography. ^AG" Is the sum 
of the standard free energy of helix propagation of the 12 b. c, and f positions (Chakrabartly, et 
a/.. 1994). Abbreviations for design methods are: hydrogen bonds (HB), polar hydrogen burial 
penalty (PB). and helbc propensity (HP). 



[0200] Sequence 6A, designed with a hydrogen-bond potential, has a preponderance of Arg and Glu residues that 
are predicted to form numerous hydrogen bonds to each other. These long chain amino adds are favored because 
they can extend across turns of the helix to interact with each other and with the backbone. When the optimal geometry 
of the scrambled 6A sequence. 6D, was found with DEE. far fewer hydrogen bonding interacUons were present and 
its score was much worse than 6A's. 6B, designed with a polar hydrogen burial penalty in addition to a hydrogen-bond 
potential, is still dominated by long residues such as Lys. Glu and Gin but has fewer Arg. Because Arg has more polar 
hydrogens than the other amino acids, it more often buries nonhydrogen-bonded protons and therefore is disfavored 
when using this potential function. 6C was designed with a hydrogen-bond potential and helix propensity in the scoring 
function and consists entirely of Ala and Arg residues, the amino acids with the highest helix propensities (Chakrabartty, 
et aL supra). The Arg residues form hydrogen bonds with Glu residues at nearby e and g positions. The random' 
hydrophilic sequence. 6E, possesses no hydrogen bonds and scores very poorly with all of the potential functions used. 
ID201] The secondary structures and themial stabilities of the peptides were assessed by circular dichroism (CD) 
spectroscopy. The CD spectra of the peptides at 1 °C and 40 are characteristic of a helices, with minima at 208 
and 222 nm, except for the random surface sequence peptkie 6E. 6E has a spectnim suggestive of a mixture of a 
helix and random coil with a [e]222+ of -12000 deg cm2/dmol. while all the other peptides are greater than 90% helical 
with [61222 of less than -30000 deg cm2/dmol. The melting temperatures (T^'s) of the designed peptides are 12-16 ""C 
higher than the T^ of GCN4-p1 , with the exception of 6E which has a T„ of 1 5 •C. CD spectra taken before and after 
melts were identical indicating reversible thermal denaturation. The redesign of surface positions of this coiled coil 
produces structures that are much more stable than wikJtype GCN4.p1 , while a random hydrophilic sequence largely 
dismpts the peptide's stability. 

10202) Size exclusion chromatography (SEC) showed that all the peptides were dimers except for 6F, the all Ala 
surface sequence, which migrated as a tetramer. These data show that surface redesign did not change the tertiary 
stmcture of these peptides, in contrast to some core redesigns (Harbury. et a/., supra). In addition, nuclear magnetic 
resonance (NMR) spectra of the peptides at 1 mM showed chemteal shift dispersion similar to GCN4.p1 (data not 

shown). 

[0203] Peptide 6A. designed with a hydrogen-bond potential, melts at 71 *>C versus 57 *»C for GCN4-p1 . demonstrat- 
ing that rational design of surface residues can produce stmctures that are markedly more stable than' naturally oc- 
curring coiled coils. This gain in stability is probably not due to improved hydrogen bonding since 6D. which has the 
same surface amino acid composition as 6A but a scrambled sequence and no predicted hydrogen bonds, also melts 
at 71 ^C. Further. 6B was designed with a different scoring function and has a different sequence and set of predated 



29 



EP 0 974 111 B1 

hydrogen bonds but a very similar T of 72 



35 



40 



45 



50 



55 



KCoinrrorr °ef uT^^^^^ -a.Ve to 0CN4-pl ,s .He. ..He. 

among ^ bes, helix formers (Chakrabart.^'S^ipS) SS^ ^ ^^^^ ^ ^"^ - ^ 

sequence position as that of hydrogen bond no esSvSr^ll^ ! ' '"""^"""^ '^^^'''l^"* 
s«^Hr^ the sequence of 6A. 7ro^^ 'r^^^te^^^T^''^' ''''' ''^ '^^^ 
standard free energies of helix propagation (TAG') rchakr^^T^ ,^ ^ sequences, the sum of the 

higher heflx propensity. Similarly. 6F has the highest" heliror^r^n^^I -m ^ «c>«er than 6A and 6B. in spite of 6Cs 
■^.096 teal/mol. but its T„ of 73 »C is only ^^1^1^^^ ^^^^ ^" ^' '^"^"^ ^ j:^g» of 

durfng SEC. not a dimer. I.Tce^ because l^Zm l^t^.^.^ 1 ^ k J^' ""^^^'^^ ^ ^ <-'^am« 

Though the results for 6C and 6F'r^,^l7c^?on '"^-'^ 
*«gn they point out possible limitations in using S^operi^"^"!"^^'"''''.'^'^^^ "nPOlant for surface 
confer the greatest stability on a structure, perh^s becTusSrt^fl^l^^ necessariV 

» Li ol^irrSs" ^drr^;^^^^^ - -^u- ^ave a dtamauc impact on ^ 

69 72 C). This result is consis^t with studies on other DroiPin^^hr^ "1 ' ^ designed sequences (T 
residues (O'Neil&DeGfado,1990;2hanoe/a/ I^TLin^ ? , /.l^'"""^''^'^'^'''*'^^ 
(^9f5),^Fur,her. these designs have sSi^I:^^^^^^ 
' •^'^"rf-'=«/«sid"escanbeusedloim^ovesti^liK^^^^^^ ^^4^)1 sequlS^emonstrating 

appears to be more important than hydTogen boning inSbXfr^re If' '^"^ ^^"^ ''"pJ 
be important in the design and stabilizab^ of other Ip^^^ZZll'^!^^^^^^^ ^^*<^" bondingTuK 

Example 3 

^ — - der Waa.. .bondmg. secondary 

^P^^oZr^^^ in seiec^ng a mo«f to test .e 

(1991)(supra)).TT,oughi.consistsof.essman^0^iurWs^^^ 

recent wo* by Imperiali and coworkers M^Tj^^'Srl^^'^^'^-^'^-'^^^^^^ Further 
P«-ine) and a non^alural amino acid O-lH^phSr^ 2 Z ^ '^''w!' "^'"'"'"^ ^'n" add (D- 

(PDB) (Bernstein, ef a/.. 1977) was examinpd w , ^ , 1996a). The Brookhaven Protein Data R;^ni. 

module o, DNA binding pUSrJ.'^^^^^^^ 

(supra)). The backbone of the second moduralio^fveTr T ^J^' "^^ ^^"^^^ (P«^te«<=h. et al. 
wji anc linge,^ in other proteins and is ^erefc^tprSe^tere oS f^H ^ "'"^ ""^^ ^'268 and 

cn^stel structure starting at lysine 33 in the numbering c^ToB em^^^^^^^ ^ '^^^ ^« ^orr, the 

llr^t """f^"^ P « «9ht turn at tTe ''Li!*^'^*"^ 1- The first 

ex^ntcfsid JLSSn^^^ltdr^^^^^^^ 

.mits to one (posiBon 5) the number of residues that caTtS aSan^? I! ^l^^ ""^s motif 

J^ositioos 3 12. 18. 21 . 22. and 25) were classified 2 Z^T^^^oZZ^"'^:'^ ^^"^ 

3. 5. and 12) and four are from the helix (positions 18 2^^^ anHoTn f '^"^^ "'^ ^^t (positions 

the core and two are in the boundary, but the f^ pS i J. t r " r '"^ ""^"^ aK68 ^ 

geometnccenterandisthereforedaLfledasas^feceS^^ 

algonthm are 4. 9. and 1 1 from the sheet 15 16 1/1^0 ° Jo. , '""^^^ considered by the design 

he^x ends. The remaining exposed positoi^.^i^- Z'r'^Te fn^^T 'L^''''" ''^ ^ ^8 which cap^he 
Pa.a..y^ed.werenot,n.u^i,,^3eq^.^-r.^^^^^^^^^^^^^^ 



30 



EP 0 974 111 B1 

acids considered at the core positions during sequence selection were A. V, L, I , F, Y. and W; the amino acids considered 
at the surface positions were A. S. T, H, D, N. E» Q, K. and R; and the combined core and surface amino acid sets (16 
amino acids) were considered at the boundary positions. 

[0209] In total. 20 out of 28 positions of the template were optimized during sequence selection. The algorithm first 
selects Gly for all positions with ^ angles greater than 0* In order to minimize backt)one strain (residues 9 and 27). The 
18 remaining residues were split Into two sets and optimized separately to speed the calculation. One set contained 
the 1 core, the 6 txjundary positions and position 8 which resulted in 1,2 x lO^ possible amino acid sequences corre- 
sponding to 4.3 X 1 0^s rotamer sequences. The other set contained the remaining 1 0 surface residues which had 1 0^^ 
possible amino add sequences and 4.1 x 10^3 rotamer sequences. The two groups do not Interact strongly with each 
other making their sequence optimizations mutually Independent, though there are strong interactions within each 
group. Each optimization was can^ied out with the non-optimized positions in the template set to the cryslallographic 
coordinates. 

[02101 The optimal sequences found from the two calculations were combined and are shown in Figure 8 aligned 
with the sequence from the second zinc finger of Zif268. Even though alt of the hydrophilic amino adds were considered 
at each of the boundary positions, only nonpolar amino adds were selected. The calculated seven core and boundary 
positions fonn a well-packed buried duster. The Phe side chains selected by the algorithm at the zinc binding His 
positions. 21 and 25. are 80% buried and the Ala at 5 is 100% buried while the Lys at 8 is greater than 60% exposed 
to solvent. The other boundary positions demonstrate the strong steric constraints on buried residues by packing similar 
side chains in an arrangement similar to Zif268. The calculated optimal configuration buried 830 of nonpolar 
surface area, with Phe 12 (96% burled) and Leu 18 (88% buried) anchoring the duster. On the helix surface, the 
algorithm posittons Asn 14 as a helix N-cap with a hydrogen bond between its side-chain carbonyl oxygen and the 
backbone amide proton of residue 16. The six charged residues on the helix fomi three pairs of hydrogen bonds, 
though in our coiled coil designs helical surface hydrogen bonds appeared to be less Important than the overall helix 
propensity of the sequence. Positions 4 and 11 on the exposed sheet surface were selected to be Thr. one of the best 
P-sheetfonning residues (Kim & Berg. 1993; Minor. etaL, (1994) (supra); Smith, etaL, (1995) (supra)). 
[021 11 Combining the 20 designed positions with the Zif268 amino acids at the remaining 8 sites results in a peptide 
with overall 39% (1 1/28) homology to Ziff268, which reduces to 1 5% (3/20) homology when only the designed positions 
are considered. A B LAST (Altschul. et al. , 1 990) search of the non-redundant protein sequence database of the National 
Center for Biotechnology Information finds weak homology, less than 40%. to several zinc finger proteins and fragments 
of other unrelated proteins. None of Uie alignments had significance values less than 0.26. By objectively selecting 20 
out of 28 residues on the Zif268 template, a peptide with little homology to known proteins and no zinc binding site 
was designed. 

[02121 Experimental characterization: The for UV drcular dichroism (CD) speclrum of the designed molecule. 
pda8d, shows a maximum at 195 nm and minima at 218 nm and 208 nm, which is indicative of a folded stmcture. The 
thermal melt is weakly cooperative, with an inflection point at 39 «C, and is completely reversible. The broad melt is 
consistent witfi a low enthalpy of folding which is expected for a motif with a smaH hydrophobic core. This behavior 
contrasts the uncooperative transitions observed for other short peptides (Weiss & Keutmann, 1990- Scholtz et al 
PNAS USA 88:2854 (1991); Struthers. et a/.. J. Am. Chem. Soc. 118:3073 (1996b)). 

[02131 Sedimentation equilibrium studies at 100 jiM and both 7 'C and 25 "C give a molecular mass of 3490, in good 
agreement with the calculated mass of 3362. indicating the peptide is monomeric. At concentrations greater man 500 
^M, however, the data do not fit well to an Ideal single species model. When the data were fit to a monomer-dimer- 
telramer model, dissociation constants of 0.5 - 1.5 mM for monomer-to-dimer and greater than 4 mM for dimer-to- 
telramer were found, though the interaction was too weak to accurately measure these values. Diffusion coeffident 
measurements using the water-sLED pulse sequence (Altieri. et aL, 1995) agreed with the sedimentation results: at 
1 00 nM pdaSd has a diffusion coeffident dose to that of a monomark: zinc finger control, while at 1 .5 mM the diffusion 
coeffident is similar to that of protein G pi , a 56 residue protein. The CD spectmm of pdaOd is concentration independent 
from 10 fiM to 2.6 mM. NMR COSY spectra taken at 2.1 mM and 100 ^M were almost kJentical with 5 of the Ha-HN 
crosspeaks shifted no more than 0.1 ppm and the rest of the crosspeaks remaining unchanged. These data indicate 
that pdaBd undergoes a weak association at high concentration, but this assodatlon has essentially no effect on the 
peptide's structure. 

[021 41 The N MR chemical shifts of pdaOd are well dispersed, suggesting that the protein is folded and well-ordered 
The Ha-HN fingerprint region of the TOCSY spectmm is well-resolved with no overiapping resonances (Figure (9A) 
and all of the Ha and HN resonances have been assigned. NMR data were collected on a V^rian Unityplus 600 MHz 
spectrometer equipped with a Nalorac inverse probe with a seif-shielded z-gradient. NMR samples were prepared in 
90/10 H2O/D2O or 99.9% DjO with 50 mM sodium phosphate at pH 5.0. Sample pH was adjusted using a glass 
electrode with no correction for the effed of DgO on measured pH. All spectra for assignments were collected al 7 X 
Sample concentration was approximately 2 mM. NMR assignments were based on standard homonudear methods 
using DQF-COSY, NOESY and TOCSY spectra (WUthrich. NMR of Proteins and Nudeic Adds (John Wiley & Sons 



31 



EP 0 974 111 B1 

S-^Sy'!Sr^SV^:;S4^^^^^^^ 2^ '-'"'^ ^2 «nd 512 increa.en,s In PI and 

spectral width of 7500 Hz and 32 transients Zesy s^,^ T '""^^'^ - ^ vvere acquired with a 

TOCSY spec™ were recorded with an isoto^ifmllTS'oreo ^o^'^c^^ "''"^ ''f l""-"- 200 mTand 

the NOESY spectra was accomplished with the WaISgate ™ f " '«^"^y- Vteter suppression in 

w^erel^enced to IheHOD resonance. SpectrTwe^^^EfnS^^^ 

ToS-Ssr ^ " ^-^^ orTsx^ri! rp^^rSed^~ 

[0215] Water-sLED experiments (Altieri efa/ iQQr;\.. 

D.0 v.th 50 sodium phosphati at" 5 a / JaTgiZt fiTw^'enl:; ^"^ ""lir ^""^ ^'^^ 
d.ffiis.on hme of 50 ms was used. Spectra were proceS v^ifs Hz Sf f ^^^^'"^ ^-^^ ^ ^3.1 G/cm and a 
high field aliphatic protons were calculated and fit to an eau^fcn Si ^^^^"'"^ ^ «^ »he aromatic and 

order to extract diffusion coefficients (Altieri ef J/ Tmsi ^ '^'""^'^^ ^"'P'""'^^ «° 9«dient strength in 

X 10-7cm2/sat 1.5 mM. 400 and lOO^M ^spS" "T"^^ ^ "^'^ 1-62x lO-^ar^l 73 

was 1.72 X 10 ^ crn% and for protein G b7;«s ?^^^^^ 
[021 6J All unambiguous sequentia; and medium^anoe MOP. 1 ^ ■ 

were foundforall pairsof revues except ^7a^ ! T ^'^"'^ ^'^^ ^^'^ HN-HN NOEs 

P2-Y3 Which have Regenerate Ha cherJLa^SLTioE s pS^^^^ '"^ '^^"^"'^ 

;rdanr:rc::rs^r.rrmr^^^^ 

(Brunger. 1992) with standard protocols for hSd A^^^^ 1 "°' "«ng X-PWR 

semble had good covalen. geometry and no^S nSfCr^.!^"'""'.' '""^^'T" ^'-ctures'in the en 
backbone was well defined with a root-mean-S.ar^Jc^MiT'r^'^l^' " ^ ^^"^ Table 5 ^e 
er,T»ni(re^dues1.2.27.and28)wereexclud^^^ 

(residues 3. 5. 7, 12. 18. 21. 22. and 25) was 1 05 A * '^"''^"^(^^S) plus the buried side chains 



Table 5. 



Intraresidue 
Sequential 
Short range ([r-j| ^ 2-5 residues) 
Long range (|t-j| > 5 residues) 
Total 



Distance restraints 




Rms deviation from distance restraints (A) 
Rms deviation from idealized geometry (A) 
Bonds (A> 
Angles (degrees) 
Impropers (degrees) 



Backbone 



Atomic rms deviations (A)* 



0.049 ±.004 



0.0051 ±0.0004 
0.76+0.04 
0.56 ± 0.04 



<SA> vs. SA ± SD 



•Atomic rms deviations are for restdaeslfA 9ft ■ .~ ' ^'^^ - ^'^^ 



32 





EP 0 974 111 B1 






Table 5. (continued) 




NMR structure determination of pda8d: distance restraints, structural statistics, atomic root-mean-square(rm5) 
deviations, and comparison to the design target. <SA> are the 32 simulated annealing structures. SA is the average 
structure and SD Is the standard deviation. The design target is the backbone of Zif266. 


Atomic rms deviations (A)* 






<SA> vs. SA ± SD 


Backbone + nor^polar side chains 

Heavy atoms 




1.05 ±0.06 
1.25 ±0.04 


Atomic rms deviations between pdaSd and the design target (A)* 






SA vs. target 


Backbone 
Heavy atoms 




1.04 
2.15 



'Atomic rms deviations are for residues 3 to 26. inclusive. The termini, residues 1, 2. 21, and 28. were highly disordered and had very few 
sequential or. non-intraresidue contacts. 



[021 8] The NMR solution stnjcture of pdaSd shows that it folds into a bba motif with well-defined secondary stmcture 
elements and tertiary organization which match the design target. A direct comparison of the design template, the 
backbone of the second zinc finger of Zif268, to the pdaSd solution stmcture highlights their similarity (data not shown). 
Alignment of the pdaSd backbone to the design target is excellent, with an atomic rms deviation of 1.04 A (Table 5). 
Pda8d and the design target correspond throughout their entire structures, including the turns connecting the secondary 
structure elements. 

[02191 In conclusion, the experimental characterization of pdaSd shows that it is folded and weil-ordered with a 
weakly cooperative thermal transition, and that its structure is an excellent match to the design target. To our knowledge, 
pdaSd is the shortest sequence of naturally occurring amino acids that folds to a unique structure without metal binding! 
oligomerization or disulfide bond formation (McKnight, ef a/., Nature Stnjc. Biol. 4:180 (1996)). The successful design 
of pdaSd supports the use of objective, quantitative sequence selection algorithms for protein design. This robustness 
suggests that the program can be used to design sequences for de novo backbones. 

Example 4 

Protein design using a scaled van der Waals scoring function in the core region 

[0220] An ideal model system to study core packing is the pi immunoglobulin-binding domain of streptococcal protein 
G{GP1) (Gronenbom, ef a/.. Science 253:657 (1991); Alexander, ef a/., Biochem. 31: 3597(1992); Barchi, efa/., Protein 
Sci. 3:15 (1994); Gallagher, ef a/.. 1994; Kuszewski, ef a/.. 1994; Oriaan, ef a/., 1995). Its small size, 56 residues, 
renders computatfons and experiments tractable. Perhaps most critical for a core packing study, Gpi contains no 
disulfide bonds and does not require a cofactor or metal ion to fokJ. Further, Gpi contains sheet, helix and turn structures 
and is without the repetitive side-chain packing patterns found in coiled coils or some helical bundles. This lack of 
periodicity reduces the bias from a particular secondary or tertiary structure and necessitates the use of an objective 
side-chain selection program to examine packing effects. 

[02211 Sequence positions that constitute the core were chosen by examining the side-chain solvent accessible 
surface area of Gpi. Any side chain exposing less than 10% of its surface was considered buried. Eleven residues 
meet this criteria, vi^lh seven from the P sheet (positfons 3. 5. 7. 20. 43, 52 and 54), three from the helix (positions 26, 
30, and 34) and one in an inBgular secondary structure (position 39)' TTiese positions form a contiguous core. The 
remainder of the protein stmcture, including all other side chains and the backbone, was used as the template for 
sequence selection calculations at the eleven core positions. 

[02221 All possible core sequences consisting of alanine, valine, leucine, isoleucine, phenylalanine, tyrosine or tryp- 
tophan (A. V, U I, F, Y or W) were considered. Our rotamer library was similar to that used by Desmet and coworkers 
(Desmet, ef a/., (1992) (supra)). Optimizing the sequence of the core or Gbl with 217 possible hydrophobic rotamers 
at all 11 positions results in 217", or 5x1025, rotamer sequences. Our scoring function consisted of two components: 
a van der V\faals energy term and an atomic solvation term favoring burial of hydrophobic surface area. The van der 
Waals radii of all atoms in the simulation were scaled by a factor a (Eqn. 3) to change the importance of packing effects. 
Radii were not scaled for the buried surface area calculattons. By predicting core sequences with various radii scalings 
and then experimentally characterizing the resulting proteins, a rigorous study of the Importance of packing effects on 



33 



10 



IS 



20 



25 



30 



35 



40 



45 



SO 



EP 0 974 111 B1 

protein design is possible 

hydrogens on the shucture which was then conjugate gradSiT^Sv^^^T. "'^ ""^^^^'^ «^'«« 

Mayo, et al.. 1990. supra). The rotamer library. DEE opSSLn and ^S^r^n "^""^ "^"^ '"^"^'^ 
LennartWones 12-6 potential was used for van der WaSnteSH!"c ^^^'^ ^^'^ ^ ^ outlined above. A 
as discussed herein. The Richards definition of sJve^ at^l ?1" ^ various cases 

areas were calculated with the Connolly algorithmTcSnn^ SS.n! T.^ f ' "^^^ ^'^ 

from our previous wori«. of 23 cal/mol/A^ was usTd toTav^hlS' k "^'^ '^'^'^ parameter, derived 

calculate sid^in nonpolar exposure in our Sri° Z ulT^'^^Tr' "^"^^ To 
exposed by a «,tamer in isolaliorl. This exposurTrdeS^sld bvS^ h hydrophobic area 

Je^m of the areas buried in palrwise rot«rota;;^^t cb' ^ " ^'-"-'^'"P'^te «n.acts. and 

Liion:;:is;7r.:T^sc«^^ 

factor used in ^...g„. ..eLp.. r.™-^^^^^^^^^ 



Table 6. 




55 



102251 In Table 6. the Gpi sequence and posit 



34 



10 



20 



25 



EP 0 974 111 B1 

in the optimal sequence, demonstrating the algorithm's robustness to minor parameter perturbations. Further, the pack- 
ing arrangements predicted with a = 0.90 - 1.05 closely match Gpi with average x angle differences of only 4'' from 
the crystal structure. The high Identity and conformational simBarity to Gpi imply that, when packing constraints are 
used, backbone conformation strongly determines a single family of well packed core designs. Nevertheless, the con- 
straints on core packing were being modulated by a as demonstrated by Monte Carto searches for other low energy 
sequences. Several alternate sequences and packing arrangements are in the twenty best sequences found by the 
Monte Carlo procedure when a = 0.90. These alternate sequeru:es score much worse when a = 0.95, and when a = 
1.0 or 1.05 only stricUy conservative packing geometries have low energies. Therefore, a = 1 .05 and a = 0.90 define 
the high and low ends, respectively, of a range where packing specificity dominates sequence design. 
[02271 Por a <0.90, the role of packing is reduced enough to let the hydrophobic surface potential begin to dominate, 
thereby increasing the size of the residues selected for the core (Table 6). A significant change in the optimal sequence 
appears between a = 0.90 and 0.85 with both a85 and a80 containing three additional mutations relative to a90. Also. 
a85 and a80 have a 1 5% increase in total side-chain volume relative to Gb 1 . As a drops below 0.80 an additional 1 0% 
increase in side-chain volume and numerous mutations occur, showing that packing constraints have been over- 
ts whelmed by the drive to bury nonpolar surface. Though the jumps in volume and shifts in packing an^ngement appear 
to occur suddenly for the optinrial sequences, examination of the suboptimal low energy sequences by Monte Carlo 
sampling demonstrates that the changes are not abrupt For example, the a85 optimal sequence is the 11^ best se- 
quence when a = 0.90. and similarly, the a90 optimal sequence is the &^ best sequence when a = 0.85. 
[0228] For a > 1.05 atomic van der Waals repulstons are so severe that most amino acids cannot find any allowed 
packing arrangements, resulting in the selection of alanine for many positions. This stringency Is likely an artifact of 
the large atomic radii and does not reflect increased packing specificity accunately. Rather, a - 1.05 is the upper limit 
for the usable range of van der Waals scales within our modeling frameworit. 

[0229] Experimental characterization of core designs. Variation of the van der \Afeials scale factor a results in four 
regimes of packing specificity: regime 1 where 0.9 5 a ^ 1 .05 and packing constraints dominate the sequence selection; 
regime 2 where 0.8 ^ a < 0.9 and the hydrophobic solvation potential begins to compete with packing forces; regime 
3 where a < 0.8 and hydrophobic solvation dominates the design: and, regime 4 where a > 1.05 and van der Waals 
repulsions appear to be too severe to allow meaningful sequence selection. Sequences that are optimal designs were 
selected from each of the regimes for synthesis and characterization. They are a 90 from regime 1 , a 85 from regime 
2, a 70 from regime 3 and a 107 from regime 4. For each of these sequences, the calculated amino add identities of 
the eleven core positions are shown in Table 6; the remainder of the protein sequence matches Gpi. The goal was to 
study the relation between the degree of packing specificity used In the core design and ttie extent of native-like char- 
acter in the resulting proteins. 

[0230] Peptide synthesis and purification. With the exception of the eleven core positions designed by the se- 
quence selection algoriUvn, the sequences synthesized match Protein Data Bank entry 1 pga. Peptides were synthe- 
sized using standard Fmoc chemistry, and were purified by reverse-phase HPLC. Matrix assisted laser desorption 
mass spectrometry found all molecular weights to be within one unit of tiie expected masses. 
[0231] CD and fluorescence spectroscopy and size exclusion chromatography. The solution conditions for all 
experiments were 50 mM sodium phosphate buffer at pH 5.5 and 25 *»C unless noted. Circular dichroism spectra were 
acquired on an Aviv 62DS spectrometer equipped with a thermoelectric unit. Peptide concentration was approximately 
20 ^iM. Themial melts were monitored at 218 nm using 2** increments wHh an equllibnation time of 120 s. T^'s were 
defined as the maxima of the derivative of the melting cun^e. Reversibility for each of the proteins was confirmed by 
comparing room temperature CD spectra from before and after heating. Guanklinium chloride denaturation measure- 
ments followed published methods (Pace, Methods. Enzymol. 131:266 (1986)). Protein concentrations were deter- 
mined by UV spectrophotometry. Fluorescence experiments were perfomried on a HItacN F-4500 in a 1 cm pathlength 
cell. Both peptide and ANS concentrations were 50 ^M. The excitation wavelengfli was 370 nm and emission was 
monitored from 400 to 600 nm. Size exclusion chromatography was performed with a PoIyLC hydroxyethyl A column 
at pH 5.5 in 50 mM sodium phosphate at 0 ^C. Ribonuclease A, carisonic anhydrase and opi were used as molecular 
weight standards. Peptide concentrations during the separation were' 15 pM as estimated from peak heights moni- 
tored at 275 nm. 

[0232] Nuclear magnetic resonance spectroscopy. Samples were prepared in 90/1 0 H2O/D2O and 50 mM sodium 
phosphate buffer at pH 5.5. Spectra were acquired on a Varian Unityplus 600 MHz spectrometer at 25 ^'C. Samples 
were approximately 1 mM. except for a70 which had limited solubility (100 jiM). For hydrogen exchange studies, an 
NMR sample was prepared, the pH was adjusted to 5.5 and a spectrum was acquired to sen/e as an unexchanged 
reference. This sample was lyophilized, reconstituted In DgO and repetitive acquisition of spectra was begun immedi- 
ately at a rate of 75 s per spectnjm. Data acquisition continued for 20 hours, tiien the sample was heated to 99 'C 
for three minutes to fully exchange all protons. After cooling to 25 *C. a final spectrum was acquired to serve as the 
fully exchanged reference. The areas of all exchangeable amide peaks were nomialized by a set of non-exchanging 
aliphatic peaks. pH values, uncorrected for isotope effects, were measured for all the samples after data acquisition 



30 



45 



35 



EP 0 974 111 B1 



» Thermalmeltemonito«dbyCOaresL.i„Rgu^^^^^^^ 

temperatures (Vs) of 83 -C and 92 -C. resp^Vely a 10?^^!^ ^ , with r,»lttr« 

My unfolded polypepUde. and a 70 has a K sta«ol LtT ««P«=««1 """a 

[0235J Tr,e extent of chemical sNf. dispersTln tLISt^MMR ^f'^T 

each protein's degree of native^ike chara^fdbta nTJ!2 ^ °' ^^"'^ P""*'" assessed to gauge 

' halimarK of a well-ordered nafive Jr:.2nS*SiSe:;i«, ^'^'^ ' "^"^ '"'^'^^ 

broadened relative to a 90. suggesting a moderat^v3Lrt^.^ ^ f dispersion and peal« that are somewhat 

NMR spectrum hasalmostnodsper^nX^ad^^^^*?'"'*'"'""^'""'■"^^^^^^ 
structurB. „ 107 has a spectrum ^th sf^nS^a^^^d^^^ .""'^PsedbutdisordeiBdandfluchiatiri 
Amide hydrogen exchange kinetics are^^tent ^ ttl^^' 1'"''""^" ^" ""'""^^^ P^^'ein. 
spectra. Measuring the avenge numberXn^SeS^e oi^J'^'^f °f '^^ ^'MR 

proteins results as follows (data not shown) a f ^ ^T" °' ^^'^^ °f designed 

25 -C. -me a 90 exchange curve is indistingu^bK^GB Vs °' ^' 5.5 and 

of amide protons, a distinctive feature of ^^ uJ^ZS Jo^^^' " f *° '"^'"'^"^ ^ well-protected set 
only about half that of a 90. The difference istT^tH^ L^vHT^ """'^^ P™'°"^' 

P^p^'^oS:oZt^Z^^S:^^ .nding were used to 

for proteins with aromatic residues fixed in a unique terta^ sK»^f ,^ . ^''P^'^d 
■ndlcatlvB of proteins with mobile aromatic resZs ^h as ^ "^"^ ^ ^b^^ spectra 

also binds ANS well, as Indicated by a lhree-SenSn^«r h m """f ^ "'^'^^ ""^^^ °70 
st^ng binding suggests that „70 possess" a loS S^^^^^^^^ '''^ «P«*""'- This 

accessible to ANS. ANS binds a85 weakly, with oS a ' g j^nl^- ' hydrophobic residues 

seenforsomenativeproteln8{Semisotno;efLrBtf^Str^^^^^^^^^^ 

fluorescence. All of the proteins migrated as mo^om^^^rall «use no change in ANS 

0237] In summary o 90 is a well-oarvZi ^ ^® exclusion chromatography, 

occurring Gb, securce'^pL^i^eS:^ ^ nael^h^^^^^^^ ' -.uraHy 

protein, albeit v^th greater modor^l flexibility thTn^OO as e^S^^^^^ l"^ " " 

behavior. a70 has all the features of a disortered coltewed oSt J ^^"^ ^""^ ^""^^ 

spectral dispersion or amide proton protection S,^ Snl^tn !, T"''*''^ ^ 
a completely unfolded chain, likely due to its S^^^^^'''''^.r''T^'^ ''"'"^ "'""^'"S- «107 Is 
frend is a loss of protein ordering L« decreas^teI«L o S ° ""^ <='««^ 

rwiL.i?a:Tsrrs,;T^r^^ 

With 0.8sa<0.9.packing forces arewe^^Se^,^lH^^.^T^^^^ well-ordered proteins. In regies 
Which produces a stable well-packed ProShr^l?L^;2°3^^^^^^ J"^?'^^' "^'^"^^ '"'^ 
forces are reduced to such an extent that the hydrophobic fo^T^n ? '^""^ ^- « " S- Packing 

structure with no stable core packing. In regimeTaT^ts S^^tSS.'' '^"J^ " P-^ially folded 

scaled too high to allow reasonable sequem» sel^cfion ar^' h^^lT h " *° '"'P'«'"«"« Packing specificity are 

sr£rSar^Tb±i~ 

smallest „ that still resultsTn IZ^^^yt^^^^^, ^^^l ^ould be designed .th the 

and well packed, suggesting 0.8 . a < 0.9 s^T^^^^^^Z ^Z hT"^ "^"^ "^'"^ 
dearty show that a85 is not as structurally ordered al^o' Sfn^T^ ^""=^"9^ ^-^wever. 

W43 in 085 and a90 present a possible expTanTfor S^aS S"" "Ilf "^"""^ P^^''"='«' P^^^^- fo^ 
confomiaSon as in the crystal structure of 4l In «85 If .1^.^ ^""^ ^"^ ^^me 

pi. a85, the larger side chains at positions 34 and 54. leucine and 



36 



EP 0 974 111 B1 



phenylalanine respectively, compared to alanine and valine in a90, force W43 to expose 91 A2 of nonpolar surface 
compared to 19 in a90. The hydrophobic driving force this exposure represents seems likely to stabilize alternate 
confomialions that bury W43 and thereby could contribute to a85*s conformational flexibility (Dill, 1985; Onuchic, et 
al., 1996). In contrast to the other core positions, a residue at position 43 can be mostly exposed or mostly buried 
depending on its side-chain conformation. We designate positions with this characteristic as boundary positions, which 
pose a difficult problem for protein design because of their potential to either strongly interact with the protein's core 
or with solvent. 

[0240] A scoring function that penalizes the exposure of hydrophobic surface area might assist in the design of 
boundary residues. Dill and coworicers used an exposure penalty to improve protein designs in a theoretical study 
(Sun, ef a/.. Protein Eng. 8(12)1205-1213 (1995)). 

[02411 A nonpolar exposure penalty would favor packing arrangements that either bury large side chains in the core 
or replace the exposed amino acid with a smaller or more polar one We implemented a side-chain nonpolar exposure 
penalty in our optimization framework and used a penalizing solvation parameter with the same magnitude as the 
hydrophobk: burial parameter. 

[0242] The results of adding a hydrophobic surface exposure penalty to our scoring function are shown in Table 7. 



Table 7. 









a=0.85 


# 




TY 
R 


LE 
U 


LE 

U 


AL 

A 


AL 

A 


PH 
E 


AL 
A 


VA 
L 


TR 
P 


PH 
E 


VA 
L 






3 


5 


7 


20 


26 


30 


34 


39 


43 


52 


54 


1 


109 


PH 
E 




ILE 








LE 

U 


ILE 


1 


TR 
P 


PH 
E 


2 


109 


1 


1 


ILE 


1 


1 


1 


LE 
U 


ILE 


1 


TR 
P 


PH 
E 


o 
o 




DUI 

E 




ILE 








LE 
U 


ILE 


1 


1 


PH 
E 


4 


104 


1 




ILE 








LE 
U 


ILE 


i 


1 


PH 
E 


5 


108 


PH 
E 




ILE 








LE 
U 


1 


1 


TR 
P 


PH 
E 


6 


62 


PH 
E 




ILE 








LE 
U 


ILE 


VA 
L 


TR 
P 


PH 
E 


7 


103 


PH 
E 




ILE 








LE 
U 


ILE 


1 


TY 
R 


PH 
E 


8 


109 


PH 

E 




VA 

L 








LE 
U 


ILE 


1 


TR 
P 


PH 
E 


9 


30 


PH 

E ■ 




ILE 








1 


ILE 


1 


1 


1 


10 


38 


PH 

E 




ILE 








1 


ILE 


1 


TR 
P 


1 


11 


108 


1 




ILE 








LE 
U 


1 


1 


TR 
P 


PH 

E 


12 


62 


1 




ILE 








LE 
U 


ILE 


VA 
L 


TR 

P 


PH 

E 


13 


109 


PH 
E 




ILE 






TY 
R 


LE 
U 


ILE 


1 


TR 
P 


PH 
E 


14 


103 


1 




ILE 






1 


LE 


ILE 


1 


TY 


PH 



37 



EP 0 974 111 B1 



Table 7. (continued) 




K. .^e'™^X:r "^'"-^ °' « = --^^ - exposure 

area because they bury W43 in a cor^^nL^^r^^^^ ^^jZT^T" 'T 1'"''^"'= 
sequences, which reduce the sbf» nf »h« «v«r..<^ u« ^ imoaei not snown). The exceptions are the 8th and 14th 

best sequence Sich r^Ss W43 v^^ ^ "^"'"^ ^" the 1 3.^ 

optimal sequence does not chanae anrf th^ r«w -i^ 7 siructural order. In contrast, when a = 0.90, the 

.««e.™slnore.ec.l"ltS-r^^ 

expose very little surface area. Burying W43 restricts seat»n«, l^cZT !^ °^ sequences 

forces for a - 0.85 still produce rXe'sZrJ^i^rirtZT= 0^^^ 

reduced packing specificity by limiUng tl^gro« oTer^cld^r„H l^' ? "^^I! '^'^"^ complements the use of 
ary is disrupted. Adding tSs constrafn. shorXX'^^^^^^^ 

XreLi^^ss^^r^^^^^^^ 

.enzed the 13«. best .quence of the a = 0%i:;:'£ZT.Z^Z^:^Z ^'^''^ 



Tables. 



a-0.85 exposure penalty 




38 



EP 0 974 111 B1 



Table 8. (continued) 



orO.SS exposure penalty 



Tr 


A 

"np 


TV 
1 1 

R 


1 p 
U 


1 c 
U 


Al 

A 


A 1 

AL 

A 


PH 

E 


A 1 

AL 
A 


VA 
L 


TR 
P 


PH 
E 


VA 
L 


3 

_ 


5 


7 


20 


26 


30 






A'i 

HO 


52 


54 




























11 


109 


1 




ILE 




1 


1 


LE 
U 


ILE 


1 


TR 
P 


PH 
E 


12 


38 


PH 
E 




ILE 




1 


1 


1 


ILE 


1 


TR 
P 


ILE 


13 


62 


PH 
E 




ILE 




1 


1 


LE 
U 


ILE 


VA 
L 


TR 
P 


PH 
E 


14 


52 


1 




ILE 




1 


1 


LE 
U 


ILE 


ILE 


1 


PH 
E 


15 


30 


PH 

E 




ILE 




1 


1 


1 


ILE 


1 


TY 
R 


ILE 



[0246] Table 8 depicts the 15beslsequencesofthecoreposiHonsof Gpi using o = 0.85 with an exposure oenaltv 
IS the exposed nonpolar surface area in A^. 

[0247] This sequence, a85W43V. replaces W43 with a valine but is othenwise identical to o85 Though the S"" and 
14»< sequences also have a smaller side chain at position 43, additional changes in their sequences relaUve to a85 
would complicate interpretation of the effect of the boundary position change. Also, a85W43V has a signHicantly dif- 
ferent packing arrangement compared to Gpl . with 7 out of 1 1 positions altered, but only an 8% increase in side-chah 
volume. Hence. a85W43V is a test of the tolerance of this fold to a different, but nearly volume conserving core The 
far UV CD spectrum of a85W43V is very similar to that of Gpi with an elliptidty at 218 nm of -14000 deg cm2/dmol 
While the secondary stnjclure content of o85W43V is native-like, its T„ is 65 °C, nearly 20 'C lovrer than o85 In 
contrast to a85W43Vs decreased stability, its NMR spectrum has greater chemical shift dispersion than a85 (data not 
shown). The amide hydrogen exchange kinetics show a well protected set of about four protons after 20 hours (data 
not shown). This faster exchange relative to a85 is explained by a85W43Vs significantly tower stabHty (Mayo & Bald- 
win, 1993). a85W43V appears to have improved structural specificity at the expense of stability, a phenomenon ob- 
served previously in coiled coils (Harbury. ef al.. 1993). By using an exposure penalty, the design algorithm produced 
a protein wnth greater native-like character. 

[0248) We have quantitatively defined the role of packing specificity in protein design and have provided practical 
bounds for the role of steric forces in our protein design program . This study differs from previous work because of the 
use of an objective, quantitative program to vary packing forces during design, which allows us to readily apply our 
conclusions to diiferent protein systems. Further, by using the minimum effective level of steric forces we wei» able 
to design a v«der variety of packing an-angements that were compatible with the given fold. Finally we have identified 
a difficulty in the design of side chains that lie at the boundary between the core and the surface of a protein and we 
have implemented a nonpolar surface exposure penalty in our sequence design scoring function that addresses this 
problem. 

Example 5 

Design of a full protein 

[0249J The entire amino acid sequence of a protein motif has been computed. As in Example 4. the second zinc 
finger module of the DNA binding protein Zif268 was selected as the design template. In order to assign the residue 
positions in he template structure into core, surface or boundary classes, the orientation of the Co-C6 vectors was 
assessed relative to a solvent accessible surface computed using only the template Ca atoms. A solvent accessible 
surface for on^^ theCa atoms of the target fold was generated using the Connolly algorithm with a probe radius of 8 0 
A a dot density of 10 A^. and a Ca radius of 1.95 A. A residue was classified as a core position if the distance from 
ts Co. along Its Ca-Cp vector, to the solvent accessible surface was greater than 5 A. and if the distance from its Cfl 
to the nearest surface point was greater than 2.0 A. The remaining residues were classified as surface positions if the 



39 



EP 0 974 111 B1 



The dassifications for Zif268 were ^t^^^ffJT^T""".'^^'^^'^^^^*^^ 
boundary to the surface dass .o sccoZTor^^^SS^r^ ^ ^ ^ 

teriary structure and inaccuracies in the assignme^f ^ '^-^ "'^^^ ^e«'d"es in the 



ui;ir:j;:^trrsr];^^^^^^ 

remaining 20 residues were assigned to ttie s JLe InterksL^!' 1\ ?^ "^^'^ ^ ^'^'y ^ 

are In the boundary or core. oneLdue p^^ 8%^?rCa ?b vtt 5[^!L"'" °' ^"268 

« cerrterand is classified as a surface nokiHnn a T: r/_„^„''/"" ft^"™ protein's geometric 

positions during sequence selection were AV LI F Y and W m ' ^"""^ ^ «>re 
were A. S. T. H. D. N. E. Q. K. and R; and^e c^^"^^,^ ™^ '""'i^^ 
considered at the boundary posiUons. Two of theTe^ES.'^^o T^fT ^"""^ ^««'«) ^^^^ 

results ina virtualcornbinatoriallitj^ofigfl^^Do^^^^ 

amino adds. 7 boundary positions JuT etsl^le"^^^^^ r"""^' (one core position with 7 possiW^ 

and 2 positions with ♦ angles greater than O^iTwr 1 nT, n ''^'^ ^-^no «cids 

^ sisting of only a single molecule foTeS^B Slt^.'^f «^«sP°™Jing pepBde library con- 

accurately model the geometdc ^t^ Jl^^^^^T °' ' ^ 

amino add side diains in our sequence ^.^nrbt^r^eSrelT "'"^ ! '^'^ 
fom«tions. called rotamers. Asabove a backto^^Z^^r!,^ ^ of allowed «.n- 

pra). with adjustments in thex, andxTitesSS^ilTn "I '"""^ "'^ K^'Plus. 
all rotamers for ead, posstole Uino'lcW af eS^^^^^^ ] ^^^""',1- '^^^S" a'9°"'hm must^n^r 

-s therefore 1.1 x 10^2 possible rotamer sequencS m,!! . °' ^^''^ ^P^-^* <he ppa motif 

CPU hours to find the optimal sequence opt.m.zat,on problem for the ppa motif required 90 

^y^i^i":^^^":::^ (^SO.,. ^ven though all of the 

adds. The eight core and boundary posifoi ^elS^^^ 

selected by the algorithm at the zincKrHrD^^^^^^ ^ weH padded buried duster. The Phe side d«ins 
at posibon 5 is 100% buried while the S "^^8 if gSr'ZS; ' ' '"'J'' '''^ ^""^^ ^ 

positions demonstrate the strong steric cor4traintem b^.ri,^^t k ' ^""^"^^ ^he other boundary 

similar to that of Zlf268. The Mst^^^f^^!^^'^/''^^^^^^ 

nonpolar surface area. On the helix surfa^ Z « . ^""^ ^'^^'^ "^^'^^^^ '^"es 1150 A2 of 

between its side-chain carbonyl ox^ge^nd th^^^ZT^ .^.''^ ' '""^ ''"'^^P ^'^ ^ ^'^^"^en borS 
thehelixfom, three pairsofhydrogenbon^lL^^^ 

«o be less important than the overall helix^S^ t^^^^ '"'^^ "^^^S^" appeared 

11 on the exposed sheet surface were seS^to L tTo^TS^^^^ a/.. Science (1997)). Positions^ and 
[0253] Figure 11 shows the alignment of thrsequ^nclsrpl^ ^^"'"^^ 1993). 

identical and only 11 (39%, are similar. f^T^n^ !^! T f ^^"'^^ ^8 residues (21 %) are 
expeotab-on that buried residues are more o^nsen^ed than^!,„!^. l""'** *^''=^ consistent v«th the 

Sderjs 247:1306-1310 (1990)). A BUStTaTZi etal tZ^*'^'!^ ^ 9*^^" '^"^ (Bowie, et a,. 

^dundant protein sequence database of the NSSa (Snt oT^!^'^ ?! ^""^^ ^9^'"^' '^'^ 
P«.tein sequences. Further, the BLAST searrfS o^ ll?der^«S^«^^ 'I'^f^ ""'^ ^^er 
ments of various unrelated proteins. The highest SS nS^hlt !^ matches of weak statistical significance to frag- 
0.63-1.0. Random 28 residue sequencesSS^sislS^ ° '^^"^ ^^"^^ *^ p values ranging from 

above produced simflar Bt^ST LTch mStsTt^ ' o rme^' "l^J" ""^^ ^''^ classification de^bed 
0.35 - 1.0. further suggesting that the matches l^d I FSD , L^rs^^^^^^ : ^ " ^^"^'"9 ^^n. 

any known protein sequence demonstrates the nc^elWonh^ FSD 1 The very low identity to 

SSr?: r "^^^ sequence scS^SoT ^"""^^ 

[02541 In order to examine the robustness of the computed ^upr^^T' 

starting point of a Monte Cario simulated annealing m^" rTn " "f FSD-1 was used as the 

quences in the neighborhood of the opfimal soiZZ^Ll^TelTit^uf' "l^^*" ^"9' '""^"^ ^ 
ground^tate solubon to the 1000«. n,ost stable sequen«7about 5 ^ ^ "^^^'^ "^e 

h.gh. The amino adds comprising the core of the r^e«T J*^, J ^ ^ ''««ity of states is 

(Figure 11). Almost all of thTsequLe var^«i^^^t37^'r''' ^J"^''"": 'n^arian 
es. Asn 14. whfch is predicted to form a heS^Srl amrn„ T i'"''*^'"''^'"^^"'^^^^ 

"X NK^ap. ,s among the most conserved surface posifions. The strong 



40 



EP 0 974111 B1 



sequence conservation observed for critical areas of the molecule suggests that rf a representative sequence folds 
into the design target structure, then perhaps thousands of sequences whose variations do not dismpt the critical 
interactions may be equally competent Even if billions of sequences would successfully achieve the target fold, they 
would represent only a vanishingly small proportion of the 102^ possible sequences. 

[02551 Experimental validation. FSD-1 was synthesized In order to characterize its structure and assess the per- 
fonriance of ttie design algorithm. The far UV circular dichroism (CD) spectrum of FSD-1 shows minima at 220 nm and 
207 nm, which Is indicative of a folded structure (data not shown). The thermal melt is weakly cooperative, with an 
inflection point at 39 'C, and Is completely reversible (data not shown). The broad melt is consistent with a low enthalpy 
of folding which Is expected for a motif with a smalt hydrophobic core. This behavior contrasts the uncooperative thermal 
unfolding transitions observed for other folded short peptides (Scholtz, et ai, 1991). FSD-1 is highly soluble (greater 
than 3 mM) and equilibrium sedimentation studies at 100 >iM, 500 and 1 mM show the protein to be monomeric. 
The sedimentation data fit well to a single species, monomer model with a molecular mass of 3630 at 1 mM, in good 
agreement with the calculated monomer mass of 3488. Also, far UV CD spectra showed no concentration dependence 
from 50 ^l^4 to 2 mM, and nuclear magnetic resonance ( NM R) COSY spectra taken at 1 00 ^M and 2 mM were essentially 
identical. 

ID256] The solution structure of FSD-1 was solved using homonuclear 2D NMR spectroscopy (Piantini, et aL, 
1 982). NMR spectra were well dispersed indicating an ordered protein structure and easing resonance assignments. 
Proton chemical shift assignments were determined with standard homonuclear methods (Wuthrich, 1986). Unambig- 
uous sequential and short^-ange NOEs indicate helical secondary structure from residues 15 to 26 in agreement with 
the design target 

[02571 The structure of FSD-1 was determined using 284 experimental restraints (10.1 restraints per residue) that 
were non-redundant with covalent structure including 274 NOE distance restraints and 10 hydrogen bond restraints 
involving slowly exchanging amide protons. Structure calculations were performed using X-PLOR (Bmnger, 1992) wiUi 
standard protocols for hybrid distance geometry-simulated annealing (Nilges, et aL, FEBS Lett. 229:317 (1988)). An 
ensemble of 41 structures converged with good covalent geometry and no distance restraint violations greater tiian 
0.3 A (Table 9). 



41 



EP 0 974 111 B1 



10 



15 



20 



25 



30 



35 



40 



Table 9. 



intraresidue 
Sequential 



Short range = 2-5 re sidues) 
Long fange > 5 residues) 



Hydrogen bond 



Total 



Structural statistics 



Rms deviation from distance restraints (A) 



45 



50 



55 



Rms deviation from idealized geometry 
Bonds (A) 
Angles (degrees) 
Impropers (degrees) 



0.0041 ± 0.0002 
0.67 ±0.02 
0.53 + 0.05 



Atomic rms deviations (A)* 



(SA), 



0.038 

0.0037 
0.65 
0.51 



Backbone 



Backbone + nonpolar side 

chainst 



Heavy atoms 



<SA> vs.SA±SD 



0.54±0.15 



0.9940.17 



1.43 ±0.20 



<SA>vs. (SA)r± 
SD 



0.69 ±0,16 



1.16±0.18 



•Atomic rms deviations are for residuP«! Ofi s«.i.-.- » ■ J 1 ± 0.29 

^ o„V se,„en,^, ^ ,• i^z'-MOE?"*- "^'^ ^ ^ ^^^^^^^^^^^ ^ sr^ J, ^^^^ , „ ,3, 

►k«pc,ar si.e Chains ,™ .0™ ™s«„« 3. s. 7. 12Ma. 2,. 2Z 2S .nich ««sl«e 
[02581 The backbone of FSD-1 is well definfid uHth « 

(residues 3-26). Considering me bun "Jde I^^J^^l^T^^^^^^^ from the mean of 0.54 A 

gives an mis deviation of 0.99 A, indicatina that the r^nf ,hl . 22. and 25) in addifion to the backlxjne 

of the ensemble of structures wa exaSS^PR^HE^^^^^^^ ^ ^^-^'^-i-l quality 

Not including the disordered termini and the gfydne^^fc^s 87%2^ V ' 'i^''- "'''^^ ^^^^^'^^ 

the remainder in the allowed region of d, w sS M^f^hl ^ "'^ ^«^°^ed region and 

Which has an average backbone angulafoTdef^ai^SSI^-:^^^^^ ^'^"^ (--^"^^ 3-6) 

^nd strand (residues 9-12) with an <S> = o'^^T?o2 a^'l'' ' i ' = ""^^ ° """P^^** t° •^'^ 

Overall. FSD-1 is notably well ordered and to our kr;o^^ae ii^» 1 i'^f "^^ '^^^^ ^^^^ = '^'^ * ° "1 

occurring arrino acids that folds to a unique s^tLreSSfr^ J?!-^ '* "'^'^■"S «"«'«'y °f "^turally 

(McKnight. eta/.. 1997). ^ - '"'^"9. «*go^erization or disulfide 

[0259] The packing pattern of the hydrophobic core of f ha mmp . 

Leu 18. Phe 21. lie 22. and Phe 25) is simirto'he ™L ' ^^^^ ^^^^ ^- P^e 12. 

Xi angles In the same gauche^ gauche- 0 T^ns ^ .Sr^fmT^^^ a^OBment Five of the seven residues have 
X2 angles. The two residues that do not r^tS"LTSZf^? ?" ^T' '^"^ X, and 

their location a. the less consMned, opeT^d X J^e iTlf ] '^^ ^S. whfch is consisted 
interactions and instead exposes abouU5% <rf Us sl^i .^^ '"^""^^ e^^ensive packing 

relative to the design temptete. Conv^eiJ ZTtZZTj^^'l^^'^'.^^'"' " ' 
(60%) and X, and X2 angles matching ti>e <:oZ^^ s^Z7 ^ '«'^«"« 

«..cHpre.^sexamina«onof^predic.edsur;rjh^;:;j:^^^^ 



42 



EP 0 974 111 B1 

from its sidechain carbonyl oxygen as predicted, but to the amide of GIu 17. not Lys 16 as expected from the design. 
This hydrogen bond is present In 95% of the structure ensemble and has a donor-acceptor distance of 2.6 ± 0.06 A. 
In general, the side chains of FSD-1 con'espond well with the design program predictions. 

[0260] A comparison of the average restrained minimized staicture of FSD-1 and the design target was done (data 
not shown). The overall backbone rms deviation of FSD-1 from the design target is 1 .98 A for residues 3-26 and only 
0.98 A for residues 8-26 (Table 10). 



10 



IS 



Table 10. 



Comparison of the FSD-1 experimentally determined stnjctureand the design target stojcture. The FSD-1 slnjcture 
is the restrained energy minimized average from the NMR stmcture determination. The design target structure is 
the second DNA binding module of the zinc finger ZiCeS (9). 



Atomic, rms deviations (A) 



Backbone, residues 3-26 



Backbone, residues 8-26 



1.98 



0.98 



Super-secondary, structure parameters* 



FSD-1 



Design Target 



20 



9.9 



8.9 



6(degrees) 



14.2 



16.5 



25 



30 



35 



40 



45 



50 



55 



Q(degrees) 



13.1 



13.5 



'h. B. ft are caJcuiated as previously described (36. 37). /i is the distance between the centroid of the herix Ca coordinates (residues 15-26) and the 
least-square plane fit to the Ca coordinates of the sheet (residues 3-12. 9 is the anQle of inclination of the principal moment of the helix Co atoms 
with the plane of the sheet ft is the angle between the projection of the principal moment of the helix onto the sheet and the projection of the average 
least-square fit line to the strand Ca coordinates (residues 3-6 and 9-12) onto the sheet 

[0261] The iargest difference between FSD-1 and the target structure occurs from residues 4-7, with a displacement 
of 3.0-3.5 A of the backbone atom positions of strand 1 . The agreement for strand 2. the strand to helix turn, and the 
helix is remarlcable, with the drfferences nearly within the accuracy of the structure determination. For this region of 
the structure, the rms difference of <|>,v angles between FSD-1 and the design target Is only 14 ± 9°. in order to quan- 
titativeiy assess the simflarity of FSD-1 to the global fold of the target, we calculated their supersecondary stnicture 
parameters (Table 9) (Janin & Chothia. J. Mol. BioL 143:95 (1980); Su & Mayo. Protein Sci. in press, 1997). which 
describe the relative orientations of secondary stmcture units in proteins. The values of 8, the inclination of the helix 
relative to the sheet, and Q, the dihedral angle between the helix axis and the strand axes, are nearly Identical. The 
height of the helix above the sheet, h. is only 1 A greater in FSD-1 . A study of protein core design as a function of helix 
height for Gbl variants demonstrated that up to 1.5 A variation in helix height has litUe effect on sequence selection 
(Su & iVIayo. supra, 1997). The comparison of secondary stmcture parameter values and backbone coordinates high- 
lights the excellent agreement between the experimentally determined structure of FSD-1 and the design target, and 
demonstrates the success of our algorithm at computing a sequence for this ppa motif. 

[0262] The quality of the match between FSD-1 and the design target demonstrates the ability of our program to 
design a sequence for a fold that contains the three major secondary stmcture elements of proteins: sheet, helix, and 
turn. Since the ppa fold is different from those used to develop the sequence selection methodology, the design of 
FSD-1 represents a successful transfer of our program to a new motif. 

Example 6 

Calculation of solvent accessible surface area scaling factors 

[0263] In contrast to the previous worii, bacl^bone atoms are included in the calculation of surface areas. Thus, the 
calculation of the scaling factors proceeds as follows. 

[0264] The program BIOGRAF (Molecular Simulations Incorporated, San Diego. Califomia) was used to generate 
explicit hydrogens on the stmctures which were then conjugate gradient minimized for 50 steps using the DREIDING 
force field. Surface areas were calculated using the Connolly algorithm with a dot density of 10 A-2 using a probe 
radius of zero and an add-on radius of 1.4A and atomic radii from the DREIDING force-field. Atoms thai contribute to 
the hydrophobic surface area are carbon, sulfur and hydrogen atoms attached to carbon and sulfur. 
[0265] For each side-chain rotamer r at residue position / with a local tri-peptide backbone f3, we calculated A^ ..^ 



43 



10 



IS 



20 



25 



30 



35 



40 



45 



EP0 974 111 B1 



the exposed area of the rolamerand its backbone in the presence of the local tri.npnHH« uk. 

area of the rotamer and its backbone in the presence of th« L u^^^ backbone, and A,, the exposed 

any side^hains not Involved in the c^lc^aS^ (^^^^^^ Ts) Se S l^"" '""^n P"*"^" ^^'^^ 

buried by the template for a rotamer r ^^ZeSJn^^^^^ ^'>' 

and . on i and / respectively, 4.., the eZs^ ^ef^thJl ''"^"^ ' ^ '^'^rners r 

calculated. The4fferencebS5?,lA^an^ ^^^^ " '^^^^^"^^ -«re template, is 

that area by the tem^ate. The palnvir;^^^^^ 



50 



Equation 29: 



55 



[0267] Noting that the buried and exposed areas should add to the total area LAO ,h« c . . 

area is: ^ "'^ s^^a* ,> 13. the solvent-exposed surface 



Equation 30: 

i *• i<i a,^t' 



^_ fri/e buried area 
pairwise buried area 



and noting that each sphere has 12 neighbors, results in: 



" l2x2nR (R-r) 



Therefore this valuelrf f should beTtoi^^^S^S'^^^ '^^'^ ^ Pa<*ing is exaggerated, 

fracdon is lower, a son,ewhat larger vaZf Sp^,^ ^^^'^^^^ ^^^-^ «he packing 

EL vi^^c^Sdie^^L^r i^^t!^^^^^^ '^r.^ '".r - — - 

Ca-cp vector relative to the surface c^X usZ ol f^^ . ."'r^*'**' "''""'■"^ °^ «^ sWe^ain's 
probe radius of 8 A and rn, B^^iJ^^Ar^?^^^'^ Ca atoms with a carbon radius of 1 .95 A. a 
aton, (along its Ca-Cp vector) to thetrfa<i w^s greZ " ^^'^"'^^ '""^ C° 

point on the surface was greater than 2 0 A T^^ aSl^c k ^ Cp atom to the nearest 

lypeactually present ateachresidue^sibt^not^S^^^^^^^^ 

proteins, total number of residues and the n.^ oTSs^n thTSrT^" "^J" '• '^'^'^^ ' 

were not considered). resMues in the core and non^Mre of each protein (Gly arid pro 



44 



EP 0 974 111 B1 



Brookhaven Identifier 


Total Size 


Core Size 


Non-Core Size 


lenh 


54 


10 


40 


1pga 


56 


10 


40 


lubi 


76 


16 


50 


Imol 


94 


19 


61 


Ikpt 


105 


27 


60 


4azu-A 


128 


39 


71 


igpr 


158 


39 


89 


1gcs 


174 


53 


98 


1edt 


266 


95 


133 


1pbn 


289 


96 


143 



10272] The classification into core and non-core was made because core residues interact more strongly with one 
another than do non-core residues. This leads to greater over-countlng of the buried surface area for core residues. 
[0273] Considering the core and non-core cases separately, the value of f which most closely reproduced the true 
Lee and Richards surface areas was calculated for the ten proteins. The palnwise approximation very closely matches 
the tnje buried surface area (data not shown). It also performs very well for the exposed hydrophobic surface area of 
non-core residues (data not shown). The calculation of the exposed surface area of the entire core of a protein Involves 
the difference of two large and nearly equal areas and Is less accurate; as will be shown, however, when there is a 
mixture of core and non-core residues, a high accuracy can still be achieved. These calculations Indicate that for core 
residues f is 0.42 and for non-core residues f is 0.79. 

[0274] To test whether the classification of residues into core and non-core was sufficient, we examined subsets of 
interacting residues in the core and non-core positions, and compared the true buried area of each subset with that 
calculated (using the above values of f). For both subsets of the core and the non-core, the correlation remained high 
(R2 = 1 .00) indicating that no further classification is necessary (data not shown). (Subsets were generated as follows: 
given a seed residue, a subset of size two was generated by adding the closest residue: the next closest residue was 
added for a subset of size three, and this was repeated up to the size of the protein. Additional subsets were generated 
by selecting different seed residues.) 

[0275] It remains to apply this approach to calculating the buried or exposed surface areas of an arbitrary selection 
of interacting core and non-core residues In a protein. When a core residue and a non-core residue Interact, we replace 
Equation 29 with: 



Equation 31: 



and Equation 30 with Equation 32: 



exposed "5Ai,t-^S^«Vi,t^V:f,C-^iA^,c) 



Where f,- and fy are the values of f appropriate for residues / and/, respectively, and i^^ takes on an intermediate value 
Using subsets from the whole of 1 pga. the optimal value of fg was found to be 0.74, This value was then shown to be 
appropriate for other lest proteins (data not shown). 



Claims 



1 . A method executed by a computer under the control of a program, said computer including a memory for storing 
said program, said method comprising the steps of: 



45 



EP 0 974 111 B1 



(A) receiving a protein backbone structure with variable residue positions- 

C ^""^ '^'^"'^ ^ boundary residue- 

(D) analyzing the interaction of each of the said rotamers with all or Dart nf ihT!^ . 
^a.e^a^.ofop«.i.edp.^^^^ 

2. -men^thodaccordngtodaim 1 • wherein at leas, one variable residue position comprises a surface or bour^^ 

3. The method according to claim 1. wherein said analyzing step comprises a DEE computation. 
J^eTf^ue^"'''^*'"""^'^^"^^^''''"''"'^^^ 

an electrostaticscoring funCon and'a Ja^^T^l Z^^::^::::^^''^^'^'^'^- 

7. The me»xxl according to claim 1 . wherein said analyzing step includes the use of a, least three scoring functions. 

8. The method according to claim L wherein said analyzing step includes the use of a. leas, four scoring functions. 

9. ;^^-<^acco,ding,odaim1.fur,hercomprising,est.ng 

10. The method according to daim 4. further comprising 

(D) generafing a rani, order list of additional optimal sequences from said globally optimal protein sequence. 

11. The method according to claim 10. wherein said generating Includes the use of a Monte Carlo search. 

12. The method according to daim 1 . wherein said analyzing step comprises a Monte Carlo seard,. 

13. The method according to daim 10, further comprising: 

(E) testing some or all of said protein sequences from said order list to produce potential energy test results. 

14. The method according to daim 13, further comprising: 

(F) analyzing the correspondence between said potential energy tes, results and theoretical potential energy 
ex^t^n' Tc^^::Z^r'^'"'' ^ -'•^ --P^^ng code means that, when they are 

d?j«raf.r:'s: " -"-^ - ^ — « 

16. The computer readable memory accordinq to claim 15 w/K-M^oir, e^i^ « 

scoring function. ' ^'"^ ^^"9 component includes a van der W^als 



46 



EP 0 974 111 B1 

17. The computer readable memory according to Claim 15, wherein said scoring component Includes an atomic sol- 
vation scoring function. 

1 8. The computer readat>le memory according to claim 1 5. wherein said scoring component includes a hydrogen bond 
scoring function. 

19. The computer readable memory according to daim 15. wherein said scoring component includes a secondary 
structure scoring function. 

20. The computer readable memory according to daim 15, further comprising an assessment module to assess the 
correspondence between potential energy test results and theoretical potential energy data. 



PatentansprQche 

1. Verfahren, das von einem Computer unter der Kontrolle eines Programms durchgefuhrt wird. wobei der Computer 
einen Speicher zum Speichem des Programms umfassl. wobei das Verfahren folgende Schritte umfasst: 

(A) den Erhalt einer Protein-Hauptkettenstruktur mit variablen Restepositionen; 

(B) das Klassifizieren jeder variablen Resteposltion entweder ats Kern-, Oberflachen- oder Randrest; 

(C) das Ermitteln einer Gruppe potentieller Rotamere fur jede der variablen Restepositionen, worin zumindest 
eine variable Resteposition Rotamere von zumindest zwel verschiedenen AminosSureseitenketten aufweist- 
und 

(D) das Analysieren der Wechselwirkung jedes der Romatere mit dem gesamten oder einem Tell des ubrigen 
Proteins, urn eInen Satz optimierter Proteinsequenzen zu bilden, worin der Schritt des Analysierens die Ver- 
wendung zumindest einer Auswertungsfunktion umfasst. 

2. Verfahren nach Anspruch 1 , worin zumindest eine variable ResteposiHon einen Oberflachen- oder Randrest um- 
fasst. 

3. Verfahren nach Anspruch 1 , worin der Schritt des Analysierens eine DEE-Berechnung umfasst. 

4. Verfahren nadi Anspmch 1, worin der Satz optimierter Proteinsequenzen die global optimale Proteinsequenz 
umfasst. 

5. Verfahren nach Anspmch 1 , worin die DEE-Berechnung aus der aus Original-DEE und Goldsteln-DEE bestehen- 
den Gruppe ausgewahit 1st. 

6. Verfahren nach Anspnjch 1 , worin die Auswertungsfunktion aus der Gruppe ausgewahit ist, die aus V^n der Waals- 
Potential-Auswertungsfunktion, einer Wasserstoftoruckenbindungs-Potential-Auswertungsfunktion. einer Atom- 
SofvaUsierungs-Auswertungsfunktion, einer elektrostatischen Auswertungsfunktton und einer Sekundarstnik- 
tumeigungs-Auswertungsfunktion besteht. 

7. Verfahren nach Anspruch 1 . worin der Analyseschritt den Einsatz von zumindest drei Auswertungsfunktionen um- 
fasst. 

8. Verfahren nach Anspmch 1 , worin der Analyseschritt den Einsatz von zumindest vier Auswertungsfunktionen um- 
fasst 

9. Verfahren nach Anspmch 1 , das welters das Testen zumindest eines Elements aiis dem Satz umfasst. urn Ver- 
suchsergebnisse zu erzielen. 

10. Verfahren nadi Anspmch 4, welters umfassend: 

(D) das Erzeugen einer Rangordnungsliste weiterer optimaler Sequenzen aus der global optimalen Protein- 
sequenz. 

11. Verfahren nach Anspmch 10, worin das Erzeugen den Einsatz einer Monte Carlo-Suche umfasst 



47 



0 



EP 0 974 111 B1 

12. Verfahren nach Anspruch 1 . worin der Analyseschmt eine Monte CarlcSuche umfasst. 

13. Verfahren nach Anspruch 20. weiters irnifessend: 

(E) das Testen einiger der oder aller Pfoteinsequenzen aus dpr R^r^^ 

potentiellen Energie zu erzeugen. ^q"®'^©" aus der Randordnungsliste. urn Testergebnisse der 

14. Verfahren nach Anspruch 13. weiters umfassend: 

Bin Reihungsmodul das zumSsT^r^-.^lf f '^^'^ Wassifiziert werden; 

Salz optlmlerter Proteinsequenzen zu erzeugen: ^ analysieren. um einen 

^^^^^^ 



Revendications 



d'au moms une fbnction da marquage. P^oie^ne. ou ladite etape rfanalyse comprend I'utilisation 

3. Methode selon la revendication 1 oCi ladite «ten*. rfor,^i 

<«fi I. ou laaite etape rfanalyse comprend un calcul DEE. 

4. Methode selon la revendication 1 oii ledit nrn..n«. ^« - 

on 1. ou ledrt g„,upe de sequences opH„,isees de proteine comprend la sequence 



48 



EP 0 974 111 B1 

globalement optimale de la proteine. 

5. Mdlhode selon la revendication 1, oii ledit calcul DEE est s6lec6onne dans le groupe constituant en DEE oriqinal 
et DEE Goldstein. 

6. Methode selon la revendication 1 , oCi tadite fonction de marquage est selectlonnee dans le groupe consistant en 
fonction de marquage de potentiel de Van der Waals. fonction de marquage de potential de liaison hydrogene. 
fonction de marquage de solvatation atomique. fonction de marquage electrostatique et fonction de marquage de 
tendance de stmcture secondalre. 

7. Methode selon la revendication 1, oO ladite 6tape d'analyse comprend Tutilisation d'au moins Irois fonctions de 
marquage. 

8. Methode selon la revendication 1 . oCi ladite 6tape d'analyse comprend Tutilisaf on d'au moins quatre fonctions de 
marquage. 

9. M6thode selon la revendication 1 . comprenant de plus le test d'au moins un membre dudit groupe pour produire 
des r§sultats experimentaux. 

10. Mdthode seton la revendication 4. comprenant de plus 

(D) la production d'une liste d'ordres de rang de sequences optimales additionnelles a partir de ladite sequence 
globalement optimale de la proteine. 

11. Methode seton la revendication 10, oCi ladite production comprend I'utilisation d'une recherche de Monte Carlo. 

12. Methode seton la revendication 1, oCi ladite etape d'analyse comprend une recherche de Monte Carlo. 

13. Mdthode selon fa revendication 10, comprenant de plus : 

(E) le test de certaines ou de la totality des sequences de la proteine a partir de ladite liste d'ordres pour 
produire des r^sultats de test d'energie de potentiel. 

14. Methode selon la revendication 13, comprenant de plus : 

(F) I'analyse de la correspondence entre les r6sultats de lest d'energie de potentiel et les donn6es theoriques 
d'energie de potentiel. 

1 5. Memotre lisible a I'ordinaleur mettant en oeuvre un programme, ledit programme comprenant des moyens de code 
qui, quand ils sont ex^utes par un ordinateur, dirigent : 

un module de chafne lateral pour meltre en correlation un groupe de rotamferes potentlels pour les positions 
des r^sidus d'un modele d'6pine dorsale de prot6ine classe comma un rdsldu de coeur, de surface ou limite, 

un module de rang comprenant au moins deux composants de fonction de marquage pour analyser Hnteraction 
de chacun desdits rotameres avec toute ou partie du restant de ladite proteine pour g6n6rer un groupe de 
sequences optimis^es de la proline. 

16. M6moire lisible a rordinateur selon la revendication 15, ou ledit composant de maiquage comprend une fonction 
de marquage de Van der Waats. 

17. Mdmoire lisible a I'ordinateur selon la revendication 15, oCi ledit composant de marquage comprend une fonction 
de marquage par solvatation atomique. 

18. Memoire lisible a I'ordinateur selon la revendication 15, ou ledit composant de marquage comprend une fonction 
de marquage par liaison d'hydrogene. 

19. Memoire lisible a I'ordinateur selon la revendication 15. ou ledit composant de marquage comprend une fonction 



49 



EP 0 974 111 B1 

de marquage de structure secondaire. 



50 



EP 0 974 111 B1 



20 




24 



PROTEIN DESIGN PROGRAM 



SIDE CHAIN MODULE 



RANKING MODULE 



SEARCH MODULE 



ASSESSMENT MODULE 



PROTEIN BACKBONE STRUCTURE 



Fiia^4 ■ 



POTENTIAL ROTANERS 



PROTEIN SEQUENCES 






51 



EP 0 974 111 B1 



PROVIDE PROTEIN 
BACKBONE 
STRUCTURE 



T 



SO 



ESTABLISH GROUP 
OF POTENTIAL 
ROTANERS 



I 



52 



ANALYZE 
INTERACTION 
OF ROTANERS 
WITH PROTEIN 
BACKBONE TO 
GENERATE 
OPTIMIZED 
PROTEIN 
SEQUENCES 



I 



54 



SEARCH PROTEIN 
SEQUENCES 



r 



56 



CONSTRUCT 

PROTEIN 
SEQUENCES 



58 




CALCULATION / STORAGE 
OF SINGLES AND 
DOUBLES ENERGIES 



I 



APPLICATION 
OF CUTOFF 



I 



ORIGINAL 
SINGLES 
DEE 



GOLDSTEIN 
SINGLES 
DEE 



I 



ORIGINAL 
DOUBLES 
DEE 



GOLDSTEIN 
DOUBLES 
DEE 



I 



ORIGINAL 
SUPER 
RESIDUE DEE 



I 



GOLDSTEIN 
SUPER 
RESIDUE DEE 



CONVERGENCE 
AT GLOBAL 
OPTIMUM 



I 



MONTE CARLO 
SEA RCH 



] 




52 



EP 0 974 111 B1 




-120 -110 -100 

SIMULATION ENERGY (kcal / mo!) 




I • I I ' l— 

•■200 -190 -180 

• SIMULATION ENERGY (kcal / mol) 



53 




54 



EP 0 974 111 B1 




EP 0 974 111 B1 




56 



EP 0 974 111 B1 



te)/io3 

(deg cm^ / dmol) 




-10- 



20 I I I I I I > I I I I I I I I I I i I I I I I I I t t t 



200 220 240 

WAVELENGTH (nm) 



(deg cm^ / dmol) 





57 



EP 0 974 111 B1 



0: 
O 



I 



o 

CM 



in 



» 
H 
b 
Q 

»q 
M 

» 

OS 

Pi 
a 



o 

fr- [w] 
oi 

H 



o 



H M 



« ft: (<ft;» 



I 

O 
CO 

u. 



Q 

CO 

0$ 
CO 

OS 



« (u] 

H H 

or Oi 

01 « 



CO 
CO 



« fi4 



t4 



— — O 

o;p$ 0SW«-0JQ« „ 

Ui K ^ 



Q 

M O 



we* « ^ 01 



» b< . pa 



- K 

— — fri >i 



Q 

> — ft» o 

o 



« w 



w 



»»» H^w„^^„„„ 



o 



000000 



g^-cvico^Ki^^gggooooooooo 



IM « 

a: 



i5 



58 



EP 0 974 111 B1 




59 





EP 0 974 111 B1 




I 
t 




F/G.. 13C FIG,, 13D 





FIG., 13E 



FIG,, 13F 



60 



Tlhiis Page is lEserted by IFW Imidexmg amd Scammmig 
Operatnoes aimd is eot part of ttlie Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of the original 
documents submitted by the appHcant. 

Defects in the images include but are not limited to the items checked: 

□ BLACK BORDERS 

□ IMAGE CUT OFF AT TOP, BOTTOM OR SIDES 

□ FADED TEXT OR DRAWING 
0BLURRED OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 

□ LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: ^ 

IMAGES ARE BEST AVAILABLE COPY. 
As rescamminig tliese docMimemts will eott correct the image 
problems checked, please do iniot report these problems to 
the IFW Image Problem Mailbox. 



II 



THIS PAGE BUNK mo) 



